1
|
Chen AA, Weinstein SM, Adebimpe A, Gur RC, Gur RE, Merikangas KR, Satterthwaite TD, Shinohara RT, Shou H. Similarity-based multimodal regression. Biostatistics 2024; 25:1122-1139. [PMID: 38058018 PMCID: PMC11471965 DOI: 10.1093/biostatistics/kxad033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 10/07/2023] [Accepted: 11/06/2023] [Indexed: 12/08/2023] Open
Abstract
To better understand complex human phenotypes, large-scale studies have increasingly collected multiple data modalities across domains such as imaging, mobile health, and physical activity. The properties of each data type often differ substantially and require either separate analyses or extensive processing to obtain comparable features for a combined analysis. Multimodal data fusion enables certain analyses on matrix-valued and vector-valued data, but it generally cannot integrate modalities of different dimensions and data structures. For a single data modality, multivariate distance matrix regression provides a distance-based framework for regression accommodating a wide range of data types. However, no distance-based method exists to handle multiple complementary types of data. We propose a novel distance-based regression model, which we refer to as Similarity-based Multimodal Regression (SiMMR), that enables simultaneous regression of multiple modalities through their distance profiles. We demonstrate through simulation, imaging studies, and longitudinal mobile health analyses that our proposed method can detect associations between clinical variables and multimodal data of differing properties and dimensionalities, even with modest sample sizes. We perform experiments to evaluate several different test statistics and provide recommendations for applying our method across a broad range of scenarios.
Collapse
Affiliation(s)
- Andrew A Chen
- Department of Public Health Sciences, Medical University of South Carolina, Charleston, SC 29425, USA
| | - Sarah M Weinstein
- Department of Epidemiology and Biostatistics, Temple University College of Public Health, Philadelphia, PA 19122, USA
| | - Azeez Adebimpe
- Penn Lifespan Informatics & Neuroimaging Center, Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Ruben C Gur
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Lifespan Brain Institute Penn Medicine and CHOP, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Raquel E Gur
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Lifespan Brain Institute Penn Medicine and CHOP, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kathleen R Merikangas
- Genetic Epidemiology Research Branch, Intramural Research Program, National Institute of Mental Health, Bethesda, MD 20892, USA
| | - Theodore D Satterthwaite
- Penn Lifespan Informatics & Neuroimaging Center, Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Psychiatry, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Russell T Shinohara
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Haochang Shou
- Penn Statistics in Imaging and Visualization Center, Department of Biostatistics, Epidemiology, and Informatics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Center for Biomedical Image Computing and Analytics, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
2
|
Yang H, Wang X, Zhang Z, Chen F, Cao H, Yan L, Gao X, Dong H, Cui Y. A high-dimensional omnibus test for set-based association analysis. Brief Bioinform 2024; 25:bbae456. [PMID: 39288231 PMCID: PMC11407446 DOI: 10.1093/bib/bbae456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 08/21/2024] [Accepted: 09/03/2024] [Indexed: 09/19/2024] Open
Abstract
Set-based association analysis is a valuable tool in studying the etiology of complex diseases in genome-wide association studies, as it allows for the joint testing of variants in a region or group. Two common types of single nucleotide polymorphism (SNP)-disease functional models are recognized when evaluating the joint function of a set of SNP: the cumulative weak signal model, in which multiple functional variants with small effects contribute to disease risk, and the dominating strong signal model, in which a few functional variants with large effects contribute to disease risk. However, existing methods have two main limitations that reduce their power. Firstly, they typically only consider one disease-SNP association model, which can result in significant power loss if the model is misspecified. Secondly, they do not account for the high-dimensional nature of SNPs, leading to low power or high false positives. In this study, we propose a solution to these challenges by using a high-dimensional inference procedure that involves simultaneously fitting many SNPs in a regression model. We also propose an omnibus testing procedure that employs a robust and powerful P-value combination method to enhance the power of SNP-set association. Our results from extensive simulation studies and a real data analysis demonstrate that our set-based high-dimensional inference strategy is both flexible and computationally efficient and can substantially improve the power of SNP-set association analysis. Application to a real dataset further demonstrates the utility of the testing strategy.
Collapse
Affiliation(s)
- Haitao Yang
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
- Hebei Key Laboratory of Environment and Human Health, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
- Hebei Key Laboratory of Forensic Medicine, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Xin Wang
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Zechen Zhang
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
- Hebei Key Laboratory of Environment and Human Health, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Fuzhao Chen
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Hongyan Cao
- Department of Health Statistics, Shanxi Provincial Key Laboratory of Major Diseases Risk Assessment, School of Public Health; MOE Key Laboratory of Coal Environmental Pathogenicity and Prevention, Shanxi Medical University, No 56 Xinjian South Rd., Taiyuan, Shanxi 030001, P.R. China
| | - Lina Yan
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
- Hebei Key Laboratory of Environment and Human Health, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Xia Gao
- Division of Health Statistics, School of Public Health, Hebei Medical University, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
- Hebei Key Laboratory of Environment and Human Health, 361 East Zhongshan Road, Shijiazhuang, Hebei 050017, P.R. China
| | - Hui Dong
- Department of Neurology, Second Hospital of Hebei Medical University, 215 West Heping Road, Shijiazhuang, Hebei 050000, P.R. China
| | - Yuehua Cui
- Department of Statistics and Probability, Michigan State University, 619 Red Cedar Rd., East Lansing, MI 48824, United States
| |
Collapse
|
3
|
Xu K, Wang Y, Jiang Y, Wang Y, Li P, Lu H, Suo C, Yuan Z, Yang Q, Dong Q, Jin L, Cui M, Chen X. Analysis of gait pattern related to high cerebral small vessel disease burden using quantitative gait data from wearable sensors. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2024; 250:108162. [PMID: 38631129 DOI: 10.1016/j.cmpb.2024.108162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 03/28/2024] [Accepted: 04/03/2024] [Indexed: 04/19/2024]
Abstract
BACKGROUND AND OBJECTIVES Sensor-based wearable devices help to obtain a wide range of quantitative gait parameters, which provides sufficient data to investigate disease-specific gait patterns. Although cerebral small vessel disease (CSVD) plays a significant role in gait impairment, the specific gait pattern associated with a high burden of CSVD remains to be explored. METHODS We analyzed the gait pattern related to high CSVD burden from 720 participants (aged 55-65 years, 42.5 % male) free of neurological disease in the Taizhou Imaging Study. All participants underwent detailed quantitative gait assessments (obtained from an insole-like wearable gait tracking device) and brain magnetic resonance imaging examinations. Thirty-three gait parameters were summarized into five gait domains. Sparse sliced inverse regression was developed to extract the gait pattern related to high CSVD burden. RESULTS The specific gait pattern derived from several gait domains (i.e., angles, phases, variability, and spatio-temporal) was significantly associated with the CSVD burden (OR=1.250, 95 % CI: 1.011-1.546). The gait pattern indicates that people with a high CSVD burden were prone to have smaller gait angles, more stance time, more double support time, larger gait variability, and slower gait velocity. Furthermore, people with this gait pattern had a 25 % higher risk of a high CSVD burden. CONCLUSIONS We established a more stable and disease-specific quantitative gait pattern related to high CSVD burden, which is prone to facilitate the identification of individuals with high CSVD burden among the community residents or the general population.
Collapse
Affiliation(s)
- Kelin Xu
- Department of Biostatistics, Ministry of Education Key Laboratory of Public Health Safety, School of Public Health, Fudan University, Shanghai, China
| | - Yingzhe Wang
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, and School of Life Sciences, Fudan University, Shanghai, China; Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China
| | - Yanfeng Jiang
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, and School of Life Sciences, Fudan University, Shanghai, China; Fudan University Taizhou Institute of Health Sciences, Taizhou, China
| | - Yawen Wang
- Department of Biostatistics, Ministry of Education Key Laboratory of Public Health Safety, School of Public Health, Fudan University, Shanghai, China
| | - Peixi Li
- Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China
| | - Heyang Lu
- Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China
| | - Chen Suo
- Fudan University Taizhou Institute of Health Sciences, Taizhou, China; Department of Epidemiology, Ministry of Education Key Laboratory of Public Health Safety, School of Public Health, Fudan University, Shanghai, China
| | - Ziyu Yuan
- Fudan University Taizhou Institute of Health Sciences, Taizhou, China
| | - Qi Yang
- Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China
| | - Qiang Dong
- Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China
| | - Li Jin
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, and School of Life Sciences, Fudan University, Shanghai, China; Fudan University Taizhou Institute of Health Sciences, Taizhou, China
| | - Mei Cui
- Department of Neurology, Huashan Hospital, Fudan University, Shanghai, China.
| | - Xingdong Chen
- State Key Laboratory of Genetic Engineering, Human Phenome Institute, and School of Life Sciences, Fudan University, Shanghai, China; Fudan University Taizhou Institute of Health Sciences, Taizhou, China.
| |
Collapse
|
4
|
Deng Q, Song C, Lin S. An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics. Eur J Hum Genet 2024; 32:681-690. [PMID: 37237036 PMCID: PMC11153499 DOI: 10.1038/s41431-023-01389-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 05/01/2023] [Accepted: 05/10/2023] [Indexed: 05/28/2023] Open
Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human traits or diseases in the past decade. Nevertheless, much of the heritability of many traits is still unaccounted for. Commonly used single-trait analysis methods are conservative, while multi-trait methods improve statistical power by integrating association evidence across multiple traits. In contrast to individual-level data, GWAS summary statistics are usually publicly available, and thus methods using only summary statistics have greater usage. Although many methods have been developed for joint analysis of multiple traits using summary statistics, there are many issues, including inconsistent performance, computational inefficiency, and numerical problems when considering lots of traits. To address these challenges, we propose a multi-trait adaptive Fisher method for summary statistics (MTAFS), a computationally efficient method with robust power performance. We applied MTAFS to two sets of brain imaging derived phenotypes (IDPs) from the UK Biobank, including a set of 58 Volumetric IDPs and a set of 212 Area IDPs. Through annotation analysis, the underlying genes of the SNPs identified by MTAFS were found to exhibit higher expression and are significantly enriched in brain-related tissues. Together with results from a simulation study, MTAFS shows its advantage over existing multi-trait methods, with robust performance across a range of underlying settings. It controls type 1 error well and can efficiently handle a large number of traits.
Collapse
Affiliation(s)
- Qiaolan Deng
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
- Department of Statistics, College of Arts and Sciences, The Ohio State University, Columbus, OH, USA
| | - Chi Song
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| | - Shili Lin
- Department of Statistics, College of Arts and Sciences, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
5
|
Wang L, Babushkin N, Liu Z, Liu X. Trans-eQTL mapping in gene sets identifies network effects of genetic variants. CELL GENOMICS 2024; 4:100538. [PMID: 38565144 PMCID: PMC11019359 DOI: 10.1016/j.xgen.2024.100538] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/10/2023] [Revised: 12/08/2023] [Accepted: 03/13/2024] [Indexed: 04/04/2024]
Abstract
Nearly all trait-associated variants identified in genome-wide association studies (GWASs) are noncoding. The cis regulatory effects of these variants have been extensively characterized, but how they affect gene regulation in trans has been the subject of fewer studies because of the difficulty in detecting trans-expression quantitative loci (eQTLs). We developed trans-PCO for detecting trans effects of genetic variants on gene networks. Our simulations demonstrate that trans-PCO substantially outperforms existing trans-eQTL mapping methods. We applied trans-PCO to two gene expression datasets from whole blood, DGN (N = 913) and eQTLGen (N = 31,684), and identified 14,985 high-quality trans-eSNP-module pairs associated with 197 co-expression gene modules and biological processes. We performed colocalization analyses between GWAS loci of 46 complex traits and the trans-eQTLs. We demonstrated that the identified trans effects can help us understand how trait-associated variants affect gene regulatory networks and biological pathways.
Collapse
Affiliation(s)
- Lili Wang
- The Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA; Department of Medicine, Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Nikita Babushkin
- Department of Medicine, Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA
| | - Zhonghua Liu
- Department of Biostatistics, Columbia University, New York, NY 10032, USA
| | - Xuanyao Liu
- The Committee on Genetics, Genomics and Systems Biology, University of Chicago, Chicago, IL 60637, USA; Department of Medicine, Section of Genetic Medicine, University of Chicago, Chicago, IL 60637, USA; Department of Human Genetics, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
6
|
Sun W, Jon K, Zhu W. Multiple phenotype association tests based on sliced inverse regression. BMC Bioinformatics 2024; 25:144. [PMID: 38575890 PMCID: PMC10996256 DOI: 10.1186/s12859-024-05731-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Accepted: 03/05/2024] [Indexed: 04/06/2024] Open
Abstract
BACKGROUND Joint analysis of multiple phenotypes in studies of biological systems such as Genome-Wide Association Studies is critical to revealing the functional interactions between various traits and genetic variants, but growth of data in dimensionality has become a very challenging problem in the widespread use of joint analysis. To handle the excessiveness of variables, we consider the sliced inverse regression (SIR) method. Specifically, we propose a novel SIR-based association test that is robust and powerful in testing the association between multiple predictors and multiple outcomes. RESULTS We conduct simulation studies in both low- and high-dimensional settings with various numbers of Single-Nucleotide Polymorphisms and consider the correlation structure of traits. Simulation results show that the proposed method outperforms the existing methods. We also successfully apply our method to the genetic association study of ADNI dataset. Both the simulation studies and real data analysis show that the SIR-based association test is valid and achieves a higher efficiency compared with its competitors. CONCLUSION Several scenarios with low- and high-dimensional responses and genotypes are considered in this paper. Our SIR-based method controls the estimated type I error at the pre-specified level α .
Collapse
Affiliation(s)
- Wenyuan Sun
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China
- Department of Mathematics, College of Science, Yanbian University, Yanji, 133002, Jilin, China
| | - Kyongson Jon
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China
- Faculty of Mathematics, Kim Il Sung University, Pyongyan , 999093, Democratic People's Republic of Korea
| | - Wensheng Zhu
- Key Laboratory for Applied Statistics of MOE, School of Mathematics and Statistics, Northeast Normal University, Changchun, 130024, Jilin, China.
- School of Mathematical Sciences, Harbin Normal University, Harbin, 150025, Heilongjiang, China.
| |
Collapse
|
7
|
Sun R, Shi A, Lin X. Differences in set-based tests for sparse alternatives when testing sets of outcomes compared to sets of explanatory factors in genetic association studies. Biostatistics 2023; 25:171-187. [PMID: 36000269 PMCID: PMC10724113 DOI: 10.1093/biostatistics/kxac036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2022] [Revised: 07/15/2022] [Accepted: 08/07/2022] [Indexed: 01/11/2023] Open
Abstract
Set-based association tests are widely popular in genetic association settings for their ability to aggregate weak signals and reduce multiple testing burdens. In particular, a class of set-based tests including the Higher Criticism, Berk-Jones, and other statistics have recently been popularized for reaching a so-called detection boundary when signals are rare and weak. Such tests have been applied in two subtly different settings: (a) associating a genetic variant set with a single phenotype and (b) associating a single genetic variant with a phenotype set. A significant issue in practice is the choice of test, especially when deciding between innovated and generalized type methods for detection boundary tests. Conflicting guidance is present in the literature. This work describes how correlation structures generate marked differences in relative operating characteristics for settings (a) and (b). The implications for study design are significant. We also develop novel power bounds that facilitate the aforementioned calculations and allow for analysis of individual testing settings. In more concrete terms, our investigation is motivated by translational expression quantitative trait loci (eQTL) studies in lung cancer. These studies involve both testing for groups of variants associated with a single gene expression (multiple explanatory factors) and testing whether a single variant is associated with a group of gene expressions (multiple outcomes). Results are supported by a collection of simulation studies and illustrated through lung cancer eQTL examples.
Collapse
Affiliation(s)
- Ryan Sun
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, 1515 Holcombe Boulevard, Houston, TX 77030, USA
| | - Andy Shi
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA
| | - Xihong Lin
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, 677 Huntington Avenue, Boston, MA 02215, USA
| |
Collapse
|
8
|
St-Pierre J, Oualkacha K. A copula-based set-variant association test for bivariate continuous, binary or mixed phenotypes. Int J Biostat 2023; 19:369-387. [PMID: 36279152 PMCID: PMC10644254 DOI: 10.1515/ijb-2022-0010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2022] [Revised: 05/26/2022] [Accepted: 08/23/2022] [Indexed: 11/15/2022]
Abstract
In genome wide association studies (GWAS), researchers are often dealing with dichotomous and non-normally distributed traits, or a mixture of discrete-continuous traits. However, most of the current region-based methods rely on multivariate linear mixed models (mvLMMs) and assume a multivariate normal distribution for the phenotypes of interest. Hence, these methods are not applicable to disease or non-normally distributed traits. Therefore, there is a need to develop unified and flexible methods to study association between a set of (possibly rare) genetic variants and non-normal multivariate phenotypes. Copulas are multivariate distribution functions with uniform margins on the [0, 1] interval and they provide suitable models to deal with non-normality of errors in multivariate association studies. We propose a novel unified and flexible copula-based multivariate association test (CBMAT) for discovering association between a genetic region and a bivariate continuous, binary or mixed phenotype. We also derive a data-driven analytic p-value procedure of the proposed region-based score-type test. Through simulation studies, we demonstrate that CBMAT has well controlled type I error rates and higher power to detect associations compared with other existing methods, for discrete and non-normally distributed traits. At last, we apply CBMAT to detect the association between two genes located on chromosome 11 and several lipid levels measured on 1477 subjects from the ASLPAC study.
Collapse
Affiliation(s)
- Julien St-Pierre
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montreal, QC, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montreal, QC, Canada
| |
Collapse
|
9
|
Zhao Y, Sun L. A stable and adaptive polygenic signal detection method based on repeated sample splitting. CAN J STAT 2023. [DOI: 10.1002/cjs.11768] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023]
|
10
|
Chen W, Coombes BJ, Larson NB. Recent advances and challenges of rare variant association analysis in the biobank sequencing era. Front Genet 2022; 13:1014947. [PMID: 36276986 PMCID: PMC9582646 DOI: 10.3389/fgene.2022.1014947] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/22/2022] [Indexed: 12/04/2022] Open
Abstract
Causal variants for rare genetic diseases are often rare in the general population. Rare variants may also contribute to common complex traits and can have much larger per-allele effect sizes than common variants, although power to detect these associations can be limited. Sequencing costs have steadily declined with technological advancements, making it feasible to adopt whole-exome and whole-genome profiling for large biobank-scale sample sizes. These large amounts of sequencing data provide both opportunities and challenges for rare-variant association analysis. Herein, we review the basic concepts of rare-variant analysis methods, the current state-of-the-art methods in utilizing variant annotations or external controls to improve the statistical power, and particular challenges facing rare variant analysis such as accounting for population structure, extremely unbalanced case-control design. We also review recent advances and challenges in rare variant analysis for familial sequencing data and for more complex phenotypes such as survival data. Finally, we discuss other potential directions for further methodology investigation.
Collapse
Affiliation(s)
- Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Brandon J. Coombes
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Nicholas B. Larson
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| |
Collapse
|
11
|
Long M, Li Z, Zhang W, Li Q. The Cauchy Combination Test under Arbitrary Dependence Structures. AM STAT 2022. [DOI: 10.1080/00031305.2022.2116109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Affiliation(s)
- Mingya Long
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences
| | | | - Wei Zhang
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences
| | - Qizhai Li
- Academy of Mathematics and Systems Science, Chinese Academy of Sciences, University of Chinese Academy of Sciences
| |
Collapse
|
12
|
Wang J, Wang W, Li H. Sparse block signal detection and identification for shared cross-trait association analysis. Ann Appl Stat 2022. [DOI: 10.1214/21-aoas1523] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jianqiao Wang
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Wanjie Wang
- Department of Statistics and Applied Probability, National University of Singapore
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| |
Collapse
|
13
|
Fan CC, Loughnan R, Makowski C, Pecheva D, Chen CH, Hagler DJ, Thompson WK, Parker N, van der Meer D, Frei O, Andreassen OA, Dale AM. Multivariate genome-wide association study on tissue-sensitive diffusion metrics highlights pathways that shape the human brain. Nat Commun 2022; 13:2423. [PMID: 35505052 PMCID: PMC9065144 DOI: 10.1038/s41467-022-30110-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2021] [Accepted: 04/12/2022] [Indexed: 11/12/2022] Open
Abstract
The molecular determinants of tissue composition of the human brain remain largely unknown. Recent genome-wide association studies (GWAS) on this topic have had limited success due to methodological constraints. Here, we apply advanced whole-brain analyses on multi-shell diffusion imaging data and multivariate GWAS to two large scale imaging genetic datasets (UK Biobank and the Adolescent Brain Cognitive Development study) to identify and validate genetic association signals. We discover 503 unique genetic loci that have impact on multiple regions of human brain. Among them, more than 79% are validated in either of two large-scale independent imaging datasets. Key molecular pathways involved in axonal growth, astrocyte-mediated neuroinflammation, and synaptogenesis during development are found to significantly impact the measured variations in tissue-specific imaging features. Our results shed new light on the biological determinants of brain tissue composition and their potential overlap with the genetic basis of neuropsychiatric disorders.
Collapse
Affiliation(s)
- Chun Chieh Fan
- Population Neuroscience and Genetics Lab, University of California, San Diego, La Jolla, CA, USA. .,Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, USA. .,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA.
| | - Robert Loughnan
- Department of Cognitive Science, University of California, San Diego, La Jolla, CA, USA
| | - Carolina Makowski
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, USA.,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Diliana Pecheva
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, USA.,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Chi-Hua Chen
- Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Donald J Hagler
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, USA.,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Wesley K Thompson
- Population Neuroscience and Genetics Lab, University of California, San Diego, La Jolla, CA, USA.,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Nadine Parker
- NORMENT Centre, Division of Mental Health and Addiction, Oslo University Hospital & Institute of Clinical Medicine, University of Oslo, Oslo, Norway
| | - Dennis van der Meer
- NORMENT Centre, Division of Mental Health and Addiction, Oslo University Hospital & Institute of Clinical Medicine, University of Oslo, Oslo, Norway.,School of Mental Health and Neuroscience, Faculty of Health, Medicine and Life Sciences, Maastricht University, Maastricht, Netherlands
| | - Oleksandr Frei
- NORMENT Centre, Division of Mental Health and Addiction, Oslo University Hospital & Institute of Clinical Medicine, University of Oslo, Oslo, Norway.,Centre for Bioinformatics, Department of Informatics, University of Oslo, Oslo, Norway
| | - Ole A Andreassen
- NORMENT Centre, Division of Mental Health and Addiction, Oslo University Hospital & Institute of Clinical Medicine, University of Oslo, Oslo, Norway
| | - Anders M Dale
- Center for Multimodal Imaging and Genetics, University of California, San Diego, La Jolla, CA, USA.,Department of Radiology, School of Medicine, University of California, San Diego, La Jolla, CA, USA.,Department of Cognitive Science, University of California, San Diego, La Jolla, CA, USA.,Department of Neuroscience, University of California, San Diego, La Jolla, CA, USA
| |
Collapse
|
14
|
Liu Y, Chen H, Heine J, Lindstrom S, Turman C, Warner ET, Winham SJ, Vachon CM, Tamimi RM, Kraft P, Jiang X. A genome-wide association study of mammographic texture variation. Breast Cancer Res 2022; 24:76. [PMCID: PMC9639267 DOI: 10.1186/s13058-022-01570-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Accepted: 10/26/2022] [Indexed: 11/09/2022] Open
Abstract
Background Breast parenchymal texture features, including grayscale variation (V), capture the patterns of texture variation on a mammogram and are associated with breast cancer risk, independent of mammographic density (MD). However, our knowledge on the genetic basis of these texture features is limited. Methods We conducted a genome-wide association study of V in 7040 European-ancestry women. V assessments were generated from digitized film mammograms. We used linear regression to test the single-nucleotide polymorphism (SNP)-phenotype associations adjusting for age, body mass index (BMI), MD phenotypes, and the top four genetic principal components. We further calculated genetic correlations and performed SNP-set tests of V with MD, breast cancer risk, and other breast cancer risk factors. Results We identified three genome-wide significant loci associated with V: rs138141444 (6q24.1) in ECT2L, rs79670367 (8q24.22) in LINC01591, and rs113174754 (12q22) near PGAM1P5. 6q24.1 and 8q24.22 have not previously been associated with MD phenotypes or breast cancer risk, while 12q22 is a known locus for both MD and breast cancer risk. Among known MD and breast cancer risk SNPs, we identified four variants that were associated with V at the Bonferroni-corrected thresholds accounting for the number of SNPs tested: rs335189 (5q23.2) in PRDM6, rs13256025 (8p21.2) in EBF2, rs11836164 (12p12.1) near SSPN, and rs17817449 (16q12.2) in FTO. We observed significant genetic correlations between V and mammographic dense area (rg = 0.79, P = 5.91 × 10−5), percent density (rg = 0.73, P = 1.00 × 10−4), and adult BMI (rg = − 0.36, P = 3.88 × 10−7). Additional significant relationships were observed for non-dense area (z = − 4.14, P = 3.42 × 10−5), estrogen receptor-positive breast cancer (z = 3.41, P = 6.41 × 10−4), and childhood body fatness (z = − 4.91, P = 9.05 × 10−7) from the SNP-set tests. Conclusions These findings provide new insights into the genetic basis of mammographic texture variation and their associations with MD, breast cancer risk, and other breast cancer risk factors. Supplementary Information The online version contains supplementary material available at 10.1186/s13058-022-01570-8.
Collapse
Affiliation(s)
- Yuxi Liu
- grid.38142.3c000000041936754XDepartment of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA ,grid.38142.3c000000041936754XProgram in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, 655 Huntington Avenue, Building 2-249A, Boston, MA 02115 USA
| | - Hongjie Chen
- grid.34477.330000000122986657Department of Epidemiology, University of Washington, Seattle, WA USA
| | - John Heine
- grid.468198.a0000 0000 9891 5233Division of Population Sciences, H. Lee Moffitt Cancer Center & Research Institute, Tampa, FL USA
| | - Sara Lindstrom
- grid.34477.330000000122986657Department of Epidemiology, University of Washington, Seattle, WA USA ,grid.270240.30000 0001 2180 1622Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, WA USA
| | - Constance Turman
- grid.38142.3c000000041936754XDepartment of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Erica T. Warner
- grid.38142.3c000000041936754XClinical and Translational Epidemiology Unit, Department of Medicine, Mongan Institute, Massachusetts General Hospital and Harvard Medical School, Boston, MA USA
| | - Stacey J. Winham
- grid.66875.3a0000 0004 0459 167XBiomedical Statistics and Informatics, Mayo Clinic, Rochester, MN USA
| | - Celine M. Vachon
- grid.66875.3a0000 0004 0459 167XDivision of Epidemiology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN USA
| | - Rulla M. Tamimi
- grid.38142.3c000000041936754XChanning Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA USA ,grid.5386.8000000041936877XDepartment of Population Health Sciences, Weill Cornell Medicine, New York, NY USA
| | - Peter Kraft
- grid.38142.3c000000041936754XDepartment of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA USA ,grid.38142.3c000000041936754XProgram in Genetic Epidemiology and Statistical Genetics, Harvard T.H. Chan School of Public Health, 655 Huntington Avenue, Building 2-249A, Boston, MA 02115 USA ,grid.38142.3c000000041936754XDepartment of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA USA
| | - Xia Jiang
- grid.465198.7Department of Clinical Neuroscience, Center for Molecular Medicine, Karolinska Institutet, Visionsgatan 18, 171 77 Solna, Stockholm Sweden ,grid.13291.380000 0001 0807 1581West China School of Public Health and West China Fourth Hospital, Sichuan University, Chengdu, China
| |
Collapse
|
15
|
Bae YE, Wu L, Wu C. InTACT: An adaptive and powerful framework for joint-tissue transcriptome-wide association studies. Genet Epidemiol 2021; 45:848-859. [PMID: 34255882 PMCID: PMC8604767 DOI: 10.1002/gepi.22425] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2021] [Revised: 06/22/2021] [Accepted: 06/24/2021] [Indexed: 11/05/2022]
Abstract
Transcriptome-wide association studies (TWAS) that integrate transcriptomic reference data and genome-wide association studies (GWAS) have successfully enhanced the discovery of candidate genes for many complex traits. However, existing methods may suffer from substantial power loss because they fail to effectively consider that expression of many genes tends to be consistent across tissues. Here we propose a computationally efficient testing method, referred to as Integrative Test for Associations via Cauchy Transformation (InTACT), that effectively combines information across multiple tissues and thus improves the power of identifying associated genes. Through simulation studies, we show that InTACT maintains high power while properly controls for Type 1 error rates. We applied InTACT to the largest GWAS of Alzheimer's disease (AD) to date and identified 227 genome-wide significant genes, of which 130 were not identified by benchmark methods, TWAS and MultiXcan. Importantly, InTACT identified five novel loci for AD. We implemented InTACT in publicly available software, "InTACT."
Collapse
Affiliation(s)
- Ye Eun Bae
- Department of Statistics, Florida State University
| | - Lang Wu
- Cancer Epidemiology Division, Population Sciences in the Pacific Program, University of Hawaii Cancer Center, University of Hawaii at Manoa
| | - Chong Wu
- Department of Statistics, Florida State University
| |
Collapse
|
16
|
Liu W, Xu Y, Wang A, Huang T, Liu Z. The eigen higher criticism and eigen Berk–Jones tests for multiple trait association studies based on GWAS summary statistics. Genet Epidemiol 2021; 46:89-104. [DOI: 10.1002/gepi.22439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2021] [Revised: 09/10/2021] [Accepted: 10/21/2021] [Indexed: 11/11/2022]
Affiliation(s)
- Wei Liu
- Department of Statistics and Actuarial Science The University of Hong Kong Hong Kong SAR China
- Department of Cell Biology and Genetics, School of Basic Medical Sciences Xi'an Jiaotong University Health Science Center Xi'an China
| | - Yuyang Xu
- Department of Statistics and Actuarial Science The University of Hong Kong Hong Kong SAR China
| | - Anqi Wang
- Department of Statistics and Actuarial Science The University of Hong Kong Hong Kong SAR China
| | - Tao Huang
- Department of Epidemiology and Biostatistics, School of Public Health Peking University Beijing China
- Institute for Artificial Intelligence, Center for Intelligent Public Health Peking University Beijing China
- Key Laboratory of Molecular Cardiovascular Diseases, Peking University Ministry of Education Beijing China
| | - Zhonghua Liu
- Department of Statistics and Actuarial Science The University of Hong Kong Hong Kong SAR China
| |
Collapse
|
17
|
Quantitative disease risk scores from EHR with applications to clinical risk stratification and genetic studies. NPJ Digit Med 2021; 4:116. [PMID: 34302027 PMCID: PMC8302667 DOI: 10.1038/s41746-021-00488-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Accepted: 05/06/2021] [Indexed: 12/30/2022] Open
Abstract
Labeling clinical data from electronic health records (EHR) in health systems requires extensive knowledge of human expert, and painstaking review by clinicians. Furthermore, existing phenotyping algorithms are not uniformly applied across large datasets and can suffer from inconsistencies in case definitions across different algorithms. We describe here quantitative disease risk scores based on almost unsupervised methods that require minimal input from clinicians, can be applied to large datasets, and alleviate some of the main weaknesses of existing phenotyping algorithms. We show applications to phenotypic data on approximately 100,000 individuals in eMERGE, and focus on several complex diseases, including Chronic Kidney Disease, Coronary Artery Disease, Type 2 Diabetes, Heart Failure, and a few others. We demonstrate that relative to existing approaches, the proposed methods have higher prediction accuracy, can better identify phenotypic features relevant to the disease under consideration, can perform better at clinical risk stratification, and can identify undiagnosed cases based on phenotypic features available in the EHR. Using genetic data from the eMERGE-seq panel that includes sequencing data for 109 genes on 21,363 individuals from multiple ethnicities, we also show how the new quantitative disease risk scores help improve the power of genetic association studies relative to the standard use of disease phenotypes. The results demonstrate the effectiveness of quantitative disease risk scores derived from rich phenotypic EHR databases to provide a more meaningful characterization of clinical risk for diseases of interest beyond the prevalent binary (case-control) classification.
Collapse
|
18
|
Sitlani CM, Baldassari AR, Highland HM, Hodonsky CJ, McKnight B, Avery CL. Comparison of adaptive multiple phenotype association tests using summary statistics in genome-wide association studies. Hum Mol Genet 2021; 30:1371-1383. [PMID: 33949650 PMCID: PMC8283209 DOI: 10.1093/hmg/ddab126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Revised: 04/26/2021] [Accepted: 04/27/2021] [Indexed: 12/15/2022] Open
Abstract
Genome-wide association studies have been successful mapping loci for individual phenotypes, but few studies have comprehensively interrogated evidence of shared genetic effects across multiple phenotypes simultaneously. Statistical methods have been proposed for analyzing multiple phenotypes using summary statistics, which enables studies of shared genetic effects while avoiding challenges associated with individual-level data sharing. Adaptive tests have been developed to maintain power against multiple alternative hypotheses because the most powerful single-alternative test depends on the underlying structure of the associations between the multiple phenotypes and a single nucleotide polymorphism (SNP). Here we compare the performance of six such adaptive tests: two adaptive sum of powered scores (aSPU) tests, the unified score association test (metaUSAT), the adaptive test in a mixed-models framework (mixAda) and two principal-component-based adaptive tests (PCAQ and PCO). Our simulations highlight practical challenges that arise when multivariate distributions of phenotypes do not satisfy assumptions of multivariate normality. Previous reports in this context focus on low minor allele count (MAC) and omit the aSPU test, which relies less than other methods on asymptotic and distributional assumptions. When these assumptions are not satisfied, particularly when MAC is low and/or phenotype covariance matrices are singular or nearly singular, aSPU better preserves type I error, sometimes at the cost of decreased power. We illustrate this trade-off with multiple phenotype analyses of six quantitative electrocardiogram traits in the Population Architecture using Genomics and Epidemiology (PAGE) study.
Collapse
Affiliation(s)
- Colleen M Sitlani
- Cardiovascular Health Research Unit, Department of Medicine, University of Washington, Seattle, WA 98101 USA
| | - Antoine R Baldassari
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516 USA
| | - Heather M Highland
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516 USA
| | - Chani J Hodonsky
- Center for Public Health Genomics, University of Virginia, Charlottesville, VA 22908 USA
| | - Barbara McKnight
- Department of Biostatistics, University of Washington, Seattle, WA 98195 USA
| | - Christy L Avery
- Department of Epidemiology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516 USA
| |
Collapse
|
19
|
Zhao Y, Sun L. On set‐based association tests: Insights from a regression using summary statistics. CAN J STAT 2020. [DOI: 10.1002/cjs.11584] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Yanyan Zhao
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
| | - Lei Sun
- Department of Statistical Sciences University of Toronto Toronto M5S 3G3 Ontario Canada
- Division of Biostatistics, Dalla Lana School of Public Health University of Toronto Toronto M5T 3M7 Ontario Canada
| |
Collapse
|
20
|
Taeb A, Shah P, Chandrasekaran V. False discovery and its control in low rank estimation. J R Stat Soc Series B Stat Methodol 2020. [DOI: 10.1111/rssb.12387] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Armeen Taeb
- California Institute of Technology Pasadena USA
| | - Parikshit Shah
- Yahoo Research and Wisconsin Institutes for Discovery Madison USA
| | | |
Collapse
|
21
|
Bu D, Yang Q, Meng Z, Zhang S, Li Q. Truncated tests for combining evidence of summary statistics. Genet Epidemiol 2020; 44:687-701. [DOI: 10.1002/gepi.22330] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2020] [Revised: 04/24/2020] [Accepted: 06/01/2020] [Indexed: 12/15/2022]
Affiliation(s)
- Deliang Bu
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing China
- Key Laboratory of Big Data Mining and Knowledge Management Chinese Academy of Sciences Beijing China
| | - Qinglong Yang
- School of Statistics and Mathematics Zhongnan University of Economics and Law Wuhan China
| | - Zhen Meng
- LSC, NCMIS, Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing China
| | - Sanguo Zhang
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing China
- Key Laboratory of Big Data Mining and Knowledge Management Chinese Academy of Sciences Beijing China
| | - Qizhai Li
- School of Mathematical Sciences University of Chinese Academy of Sciences Beijing China
- LSC, NCMIS, Academy of Mathematics and Systems Science Chinese Academy of Sciences Beijing China
| |
Collapse
|
22
|
Liu Z, Barnett I, Lin X. A COMPARISON OF PRINCIPAL COMPONENT METHODS BETWEEN MULTIPLE PHENOTYPE REGRESSION AND MULTIPLE SNP REGRESSION IN GENETIC ASSOCIATION STUDIES. Ann Appl Stat 2020; 14:433-451. [PMID: 37398898 PMCID: PMC10313330 DOI: 10.1214/19-aoas1312] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2023]
Abstract
Principal component analysis (PCA) is a popular method for dimension reduction in unsupervised multivariate analysis. However, existing ad hoc uses of PCA in both multivariate regression (multiple outcomes) and multiple regression (multiple predictors) lack theoretical justification. The differences in the statistical properties of PCAs in these two regression settings are not well understood. In this paper we provide theoretical results on the power of PCA in genetic association testings in both multiple phenotype and SNP-set settings. The multiple phenotype setting refers to the case when one is interested in studying the association between a single SNP and multiple phenotypes as outcomes. The SNP-set setting refers to the case when one is interested in studying the association between multiple SNPs in a SNP set and a single phenotype as the outcome. We demonstrate analytically that the properties of the PC-based analysis in these two regression settings are substantially different. We show that the lower order PCs, that is, PCs with large eigenvalues, are generally preferred and lead to a higher power in the SNP-set setting, while the higher-order PCs, that is, PCs with small eigenvalues, are generally preferred in the multiple phenotype setting. We also investigate the power of three other popular statistical methods, the Wald test, the variance component test and the minimum p -value test, in both multiple phenotype and SNP-set settings. We use theoretical power, simulation studies, and two real data analyses to validate our findings.
Collapse
Affiliation(s)
- Zhonghua Liu
- Department of Statistics and Actuarial Science, The University of Hong Kong
| | - Ian Barnett
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania
| | - Xihong Lin
- Department of Biostatistics and Statistics, Harvard University
| |
Collapse
|
23
|
Effect of non-normality and low count variants on cross-phenotype association tests in GWAS. Eur J Hum Genet 2019; 28:300-312. [PMID: 31582815 DOI: 10.1038/s41431-019-0514-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2018] [Revised: 09/01/2019] [Accepted: 09/05/2019] [Indexed: 01/21/2023] Open
Abstract
Many complex human diseases, such as type 2 diabetes, are characterized by multiple underlying traits/phenotypes that have substantially shared genetic architecture. Multivariate analysis of correlated traits has the potential to increase the power of detecting underlying common genetic loci. Several cross-phenotype association methods have been proposed-some require individual-level data on traits and genotypes, while the others require only summary-level data. In this article, we explore whether non-normality of multivariate trait distribution affects the inference from some of the existing multi-trait methods and how that effect is dependent on the allele count of the genetic variant being tested. We find that most of these tests are susceptible to biases that lead to spurious association signals. Even after controlling for confounders that may contribute to non-normality and then applying inverse normal transformation on the residuals of each trait, these tests may have inflated type I errors for variants with low minor allele counts (MACs). A likelihood ratio test of association based on the ordinal regression of individual-level genotype conditional on the traits seems to be the least biased and can maintain type I error when the MAC is reasonably large (e.g., MAC > 30). Application of these methods to publicly available summary statistics of eight amino acid traits on European samples seem to exhibit systematic inflation (especially for variants with low MAC), which is consistent with our findings from simulation experiments.
Collapse
|
24
|
Liu Y, Xie J. Cauchy combination test: a powerful test with analytic p-value calculation under arbitrary dependency structures. J Am Stat Assoc 2019; 115:393-402. [PMID: 33012899 DOI: 10.1080/01621459.2018.1554485] [Citation(s) in RCA: 164] [Impact Index Per Article: 32.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Abstract
Combining individual p-values to aggregate multiple small effects has a long-standing interest in statistics, dating back to the classic Fisher's combination test. In modern large-scale data analysis, correlation and sparsity are common features and efficient computation is a necessary requirement for dealing with massive data. To overcome these challenges, we propose a new test that takes advantage of the Cauchy distribution. Our test statistic has a simple form and is defined as a weighted sum of Cauchy transformation of individual p-values. We prove a non-asymptotic result that the tail of the null distribution of our proposed test statistic can be well approximated by a Cauchy distribution under arbitrary dependency structures. Based on this theoretical result, the p-value calculation of our proposed test is not only accurate, but also as simple as the classic z-test or t-test, making our test well suited for analyzing massive data. We further show that the power of the proposed test is asymptotically optimal in a strong sparsity setting. Extensive simulations demonstrate that the proposed test has both strong power against sparse alternatives and a good accuracy with respect to p-value calculations, especially for very small p-values. The proposed test has also been applied to a genome-wide association study of Crohn's disease and compared with several existing tests.
Collapse
Affiliation(s)
- Yaowu Liu
- Department of Biostatistics, Harvard School of Public Health
| | - Jun Xie
- Department of Statistics, Purdue University
| |
Collapse
|