1
|
Deng Q, Song C, Lin S. An adaptive and robust method for multi-trait analysis of genome-wide association studies using summary statistics. Eur J Hum Genet 2024; 32:681-690. [PMID: 37237036 PMCID: PMC11153499 DOI: 10.1038/s41431-023-01389-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 05/01/2023] [Accepted: 05/10/2023] [Indexed: 05/28/2023] Open
Abstract
Genome-wide association studies (GWAS) have identified thousands of genetic variants associated with human traits or diseases in the past decade. Nevertheless, much of the heritability of many traits is still unaccounted for. Commonly used single-trait analysis methods are conservative, while multi-trait methods improve statistical power by integrating association evidence across multiple traits. In contrast to individual-level data, GWAS summary statistics are usually publicly available, and thus methods using only summary statistics have greater usage. Although many methods have been developed for joint analysis of multiple traits using summary statistics, there are many issues, including inconsistent performance, computational inefficiency, and numerical problems when considering lots of traits. To address these challenges, we propose a multi-trait adaptive Fisher method for summary statistics (MTAFS), a computationally efficient method with robust power performance. We applied MTAFS to two sets of brain imaging derived phenotypes (IDPs) from the UK Biobank, including a set of 58 Volumetric IDPs and a set of 212 Area IDPs. Through annotation analysis, the underlying genes of the SNPs identified by MTAFS were found to exhibit higher expression and are significantly enriched in brain-related tissues. Together with results from a simulation study, MTAFS shows its advantage over existing multi-trait methods, with robust performance across a range of underlying settings. It controls type 1 error well and can efficiently handle a large number of traits.
Collapse
Affiliation(s)
- Qiaolan Deng
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
- Department of Statistics, College of Arts and Sciences, The Ohio State University, Columbus, OH, USA
| | - Chi Song
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, USA
| | - Shili Lin
- Department of Statistics, College of Arts and Sciences, The Ohio State University, Columbus, OH, USA.
| |
Collapse
|
2
|
Li Y, Fang Y, Chang HC, Gorczyca M, Liu P, Tseng GC. Adaptively Integrative Association between Multivariate Phenotypes and Transcriptomic Data for Complex Diseases. Genes (Basel) 2023; 14:genes14040798. [PMID: 37107556 PMCID: PMC10138055 DOI: 10.3390/genes14040798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 03/08/2023] [Accepted: 03/14/2023] [Indexed: 03/29/2023] Open
Abstract
Phenotype–gene association studies can uncover disease mechanisms for translational research. Association with multiple phenotypes or clinical variables in complex diseases has the advantage of increasing statistical power and offering a holistic view. Existing multi-variate association methods mostly focus on SNP-based genetic associations. In this paper, we extend and evaluate two adaptive Fisher’s methods, namely AFp and AFz, from the p-value combination perspective for phenotype–mRNA association analysis. The proposed method effectively aggregates heterogeneous phenotype–gene effects, allows association with different data types of phenotypes, and performs the selection of the associated phenotypes. Variability indices of the phenotype–gene effect selection are calculated by bootstrap analysis, and the resulting co-membership matrix identifies gene modules clustered by phenotype–gene effect. Extensive simulations demonstrate the superior performance of AFp compared to existing methods in terms of type I error control, statistical power and biological interpretation. Finally, the method is separately applied to three sets of transcriptomic and clinical datasets from lung disease, breast cancer, and brain aging and generates intriguing biological findings.
Collapse
Affiliation(s)
- Yujia Li
- Eli Lilly and Company, Indianapolis, IN 46225, USA
| | - Yusi Fang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Hung-Ching Chang
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Michael Gorczyca
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Peng Liu
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - George C. Tseng
- Department of Biostatistics, University of Pittsburgh, Pittsburgh, PA 15261, USA
- Correspondence:
| |
Collapse
|
3
|
Luo X, Cai G, Mclain AC, Amos CI, Cai B, Xiao F. BMI-CNV: a Bayesian framework for multiple genotyping platforms detection of copy number variants. Genetics 2022; 222:iyac147. [PMID: 36171678 PMCID: PMC9713397 DOI: 10.1093/genetics/iyac147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2022] [Accepted: 09/08/2022] [Indexed: 12/13/2022] Open
Abstract
Whole-exome sequencing (WES) enables the detection of copy number variants (CNVs) with high resolution in protein-coding regions. However, variants in the intergenic or intragenic regions are excluded from studies. Fortunately, many of these samples have been previously sequenced by other genotyping platforms which are sparse but cover a wide range of genomic regions, such as SNP array. Moreover, conventional single sample-based methods suffer from a high false discovery rate due to prominent data noise. Therefore, methods for integrating multiple genotyping platforms and multiple samples are highly demanded for improved copy number variant detection. We developed BMI-CNV, a Bayesian Multisample and Integrative CNV (BMI-CNV) profiling method with data sequenced by both whole-exome sequencing and microarray. For the multisample integration, we identify the shared copy number variants regions across samples using a Bayesian probit stick-breaking process model coupled with a Gaussian Mixture model estimation. With extensive simulations, BMI-copy number variant outperformed existing methods with improved accuracy. In the matched data from the 1000 Genomes Project and HapMap project data, BMI-CNV also accurately detected common variants and significantly enlarged the detection spectrum of whole-exome sequencing. Further application to the data from The Research of International Cancer of Lung consortium (TRICL) identified lung cancer risk variant candidates in 17q11.2, 1p36.12, 8q23.1, and 5q22.2 regions.
Collapse
Affiliation(s)
- Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Sciences, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Alexander C Mclain
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Christopher I Amos
- Department of Quantitative Sciences, Baylor College of Medicine, Houston, TX 77030, USA
| | - Bo Cai
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Biostatistics, University of Florida, Gainesville, FL 32603, USA
| |
Collapse
|
4
|
Chen Q, Lin S, Song C. An Adaptive and Robust Test for Microbial Community Analysis. Front Genet 2022; 13:846258. [PMID: 35664318 PMCID: PMC9162041 DOI: 10.3389/fgene.2022.846258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2021] [Accepted: 03/28/2022] [Indexed: 11/21/2022] Open
Abstract
In microbiome studies, researchers measure the abundance of each operational taxon unit (OTU) and are often interested in testing the association between the microbiota and the clinical outcome while conditional on certain covariates. Two types of approaches exists for this testing purpose: the OTU-level tests that assess the association between each OTU and the outcome, and the community-level tests that examine the microbial community all together. It is of considerable interest to develop methods that enjoy both the flexibility of OTU-level tests and the biological relevance of community-level tests. We proposed MiAF, a method that adaptively combines p-values from the OTU-level tests to construct a community-level test. By borrowing the flexibility of OTU-level tests, the proposed method has great potential to generate a series of community-level tests that suit a range of different microbiome profiles, while achieving the desirable high statistical power of community-level testing methods. Using simulation study and real data applications in a smoker throat microbiome study and a HIV patient stool microbiome study, we demonstrated that MiAF has comparable or better power than methods that are specifically designed for community-level tests. The proposed method also provides a natural heuristic taxa selection.
Collapse
Affiliation(s)
- Qingyu Chen
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, United States
| | - Shili Lin
- Department of Statistics, College of Arts and Sciences, The Ohio State University, Columbus, OH, United States
- *Correspondence: Shili Lin, ; Chi Song,
| | - Chi Song
- Division of Biostatistics, College of Public Health, The Ohio State University, Columbus, OH, United States
- *Correspondence: Shili Lin, ; Chi Song,
| |
Collapse
|
5
|
Heard NA. Standardized Partial Sums and Products of p-Values. J Comput Graph Stat 2021. [DOI: 10.1080/10618600.2021.1999822] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- N. A. Heard
- Department of Mathematics, Imperial College London, London, UK
| |
Collapse
|
6
|
Qin F, Luo X, Cai G, Xiao F. Shall genomic correlation structure be considered in copy number variants detection? Brief Bioinform 2021; 22:6295811. [PMID: 34114005 DOI: 10.1093/bib/bbab215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 04/16/2021] [Accepted: 05/17/2021] [Indexed: 11/14/2022] Open
Abstract
Copy number variation has been identified as a major source of genomic variation associated with disease susceptibility. With the advent of whole-exome sequencing (WES) technology, massive WES data have been generated, allowing for the identification of copy number variants (CNVs) in the protein-coding regions with direct functional interpretation. We have previously shown evidence of the genomic correlation structure in array data and developed a novel chromosomal breakpoint detection algorithm, LDcnv, which showed significantly improved detection power through integrating the correlation structure in a systematic modeling manner. However, it remains unexplored whether the genomic correlation exists in WES data and how such correlation structure integration can improve the CNV detection accuracy. In this study, we first explored the correlation structure of the WES data using the 1000 Genomes Project data. Both real raw read depth and median-normalized data showed strong evidence of the correlation structure. Motivated by this fact, we proposed a correlation-based method, CORRseq, as a novel release of the LDcnv algorithm in profiling WES data. The performance of CORRseq was evaluated in extensive simulation studies and real data analysis from the 1000 Genomes Project. CORRseq outperformed the existing methods in detecting medium and large CNVs. In conclusion, it would be more advantageous to model genomic correlation structure in detecting relatively long CNVs. This study provides great insights for methodology development of CNV detection with NGS data.
Collapse
Affiliation(s)
- Fei Qin
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, University of South Carolina (USC), Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Xizhi Luo
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Guoshuai Cai
- Department of Environmental Health Science, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| | - Feifei Xiao
- Department of Epidemiology and Biostatistics, Arnold School of Public Health, USC, Discovery 449, 915 Greene St, Columbia, SC 29208, USA
| |
Collapse
|
7
|
Cai X, Chang LB, Potter J, Song C. Adaptive Fisher method detects dense and sparse signals in association analysis of SNV sets. BMC Med Genomics 2020; 13:46. [PMID: 32241265 PMCID: PMC7118831 DOI: 10.1186/s12920-020-0684-3] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
BACKGROUND With the development of next generation sequencing (NGS) technology and genotype imputation methods, statistical methods have been proposed to test a set of genomic variants together to detect if any of them is associated with the phenotype or disease. In practice, within the set, there is an unknown proportion of variants truly causal or associated with the disease. There is a demand for statistical methods with high power in both dense and sparse scenarios, where the proportion of causal or associated variants is large or small respectively. RESULTS We propose a new association test - weighted Adaptive Fisher (wAF) that can adapt to both dense and sparse scenarios by adding weights to the Adaptive Fisher (AF) method we developed before. Using simulation, we show that wAF enjoys comparable or better power to popular methods such as sequence kernel association tests (SKAT and SKAT-O) and adaptive SPU (aSPU) test. We apply wAF to a publicly available schizophrenia dataset, and successfully detect thirteen genes. Among them, three genes are supported by existing literature; six are plausible as they either relate to other neurological diseases or have relevant biological functions. CONCLUSIONS The proposed wAF method is a powerful disease-variants association test in both dense and sparse scenarios. Both simulation studies and real data analysis indicate the potential of wAF for new biological findings.
Collapse
Affiliation(s)
- Xiaoyu Cai
- Department of Statistics, The Ohio State University, 1948 Neil Ave., Columbus, OH 43210, US
| | - Lo-Bin Chang
- Department of Statistics, The Ohio State University, 1948 Neil Ave., Columbus, OH 43210, US
| | - Jordan Potter
- Department of Mathematics and Statistics, Kenyon College, 201 N College Rd., Gambier, Ohio 43022, US
| | - Chi Song
- College of Public Health, Division of Biostatistics, The Ohio State University, 1841 Neil Ave., 208E Cunz Hall, Columbus, OH 43210, US
| |
Collapse
|
8
|
Song C, Min X, Zhang H. THE SCREENING AND RANKING ALGORITHM FOR CHANGE-POINTS DETECTION IN MULTIPLE SAMPLES. Ann Appl Stat 2017; 10:2102-2129. [PMID: 28090239 DOI: 10.1214/16-aoas966] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
The chromosome copy number variation (CNV) is the deviation of genomic regions from their normal copy number states, which may associate with many human diseases. Current genetic studies usually collect hundreds to thousands of samples to study the association between CNV and diseases. CNVs can be called by detecting the change-points in mean for sequences of array-based intensity measurements. Although multiple samples are of interest, the majority of the available CNV calling methods are single sample based. Only a few multiple sample methods have been proposed using scan statistics that are computationally intensive and designed toward either common or rare change-points detection. In this paper, we propose a novel multiple sample method by adaptively combining the scan statistic of the screening and ranking algorithm (SaRa), which is computationally efficient and is able to detect both common and rare change-points. We prove that asymptotically this method can find the true change-points with almost certainty and show in theory that multiple sample methods are superior to single sample methods when shared change-points are of interest. Additionally, we report extensive simulation studies to examine the performance of our proposed method. Finally, using our proposed method as well as two competing approaches, we attempt to detect CNVs in the data from the Primary Open-Angle Glaucoma Genes and Environment study, and conclude that our method is faster and requires less information while our ability to detect the CNVs is comparable or better.
Collapse
|