1
|
Kim K, Jun TH, Ha BK, Wang S, Sun H. New statistical selection method for pleiotropic variants associated with both quantitative and qualitative traits. BMC Bioinformatics 2023; 24:381. [PMID: 37817069 PMCID: PMC10563219 DOI: 10.1186/s12859-023-05505-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 09/28/2023] [Indexed: 10/12/2023] Open
Abstract
BACKGROUND Identification of pleiotropic variants associated with multiple phenotypic traits has received increasing attention in genetic association studies. Overlapping genetic associations from multiple traits help to detect weak genetic associations missed by single-trait analyses. Many statistical methods were developed to identify pleiotropic variants with most of them being limited to quantitative traits when pleiotropic effects on both quantitative and qualitative traits have been observed. This is a statistically challenging problem because there does not exist an appropriate multivariate distribution to model both quantitative and qualitative data together. Alternatively, meta-analysis methods can be applied, which basically integrate summary statistics of individual variants associated with either a quantitative or a qualitative trait without accounting for correlations among genetic variants. RESULTS We propose a new statistical selection method based on a unified selection score quantifying how a genetic variant, i.e., a pleiotropic variant associates with both quantitative and qualitative traits. In our extensive simulation studies where various types of pleiotropic effects on both quantitative and qualitative traits were considered, we demonstrated that the proposed method outperforms the existing meta-analysis methods in terms of true positive selection. We also applied the proposed method to a peanut dataset with 6 quantitative and 2 qualitative traits, and a cowpea dataset with 2 quantitative and 6 qualitative traits. We were able to detect some potentially pleiotropic variants missed by the existing methods in both analyses. CONCLUSIONS The proposed method is able to locate pleiotropic variants associated with both quantitative and qualitative traits. It has been implemented into an R package 'UNISS', which can be downloaded from http://github.com/statpng/uniss.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, 46241, Busan, Korea
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, 50463, Miryang, Korea
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, 61186, Gwangju, Korea
| | - Shuang Wang
- Department of Biostatistics, Mailman School of Public Health, Columbia University, New York, 10032, USA
| | - Hokeun Sun
- Department of Statistic, Pusan National University, 46241, Busan, Korea.
| |
Collapse
|
2
|
Caballero FF, Lana A, Struijk EA, Arias-Fernández L, Yévenes-Briones H, Cárdenas-Valladolid J, Salinero-Fort MÁ, Banegas JR, Rodríguez-Artalejo F, Lopez-Garcia E. Prospective Association Between Plasma Concentrations of Fatty Acids and Other Lipids, and Multimorbidity in Older Adults. J Gerontol A Biol Sci Med Sci 2023; 78:1763-1770. [PMID: 37156635 DOI: 10.1093/gerona/glad122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2022] [Indexed: 05/10/2023] Open
Abstract
Biological mechanisms that lead to multimorbidity are mostly unknown, and metabolomic profiles are promising to explain different pathways in the aging process. The aim of this study was to assess the prospective association between plasma fatty acids and other lipids, and multimorbidity in older adults. Data were obtained from the Spanish Seniors-ENRICA 2 cohort, comprising noninstitutionalized adults ≥65 years old. Blood samples were obtained at baseline and after a 2-year follow-up period for a total of 1 488 subjects. Morbidity was also collected at baseline and end of the follow-up from electronic health records. Multimorbidity was defined as a quantitative score, after weighting morbidities (from a list of 60 mutually exclusive chronic conditions) by their regression coefficients on physical functioning. Generalized estimating equation models were employed to assess the longitudinal association between fatty acids and other lipids, and multimorbidity, and stratified analyses by diet quality, measured with the Alternative Healthy Eating Index-2010, were also conducted. Among study participants, higher concentrations of omega-6 fatty acids [coef. per 1-SD increase (95% CI) = -0.76 (-1.23, -0.30)], phosphoglycerides [-1.26 (-1.77, -0.74)], total cholines [-1.48 (-1.99, -0.96)], phosphatidylcholines [-1.23 (-1.74, -0.71)], and sphingomyelins [-1.65 (-2.12, -1.18)], were associated with lower multimorbidity scores. The strongest associations were observed for those with a higher diet quality. Higher plasma concentrations of omega-6 fatty acids, phosphoglycerides, total cholines, phosphatidylcholines, and sphingomyelins were prospectively associated with lower multimorbidity in older adults, although diet quality could modulate the associations found. These lipids may serve as risk markers for multimorbidity.
Collapse
Affiliation(s)
- Francisco Félix Caballero
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
| | - Alberto Lana
- Department of Medicine, Universidad de Oviedo/ISPA, Oviedo, Spain
| | - Ellen A Struijk
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
| | | | - Humberto Yévenes-Briones
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
| | - Juan Cárdenas-Valladolid
- Dirección Técnica de Sistemas de Información. Gerencia Asistencial de Atención Primaria, Servicio Madrileño de Salud, Fundación de Investigación e Innovación Biosanitaria de Atención Primaria, Madrid, Spain
- Enfermería, Universidad Alfonso X El Sabio, Villanueva de la Cañada, Spain
| | - Miguel Ángel Salinero-Fort
- Subdirección General de Investigación Sanitaria, Consejería de Sanidad, Fundación de Investigación e Innovación Sanitaria de Atención Primaria, Madrid, Spain
- Red de Investigación en Servicios de Salud en Enfermedades Crónicas, Grupo de Envejecimiento y Fragilidad de las personas mayores. IdIPAZ, Madrid, Spain
| | - José R Banegas
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
| | - Fernando Rodríguez-Artalejo
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
- IMDEA-Food Institute. CEI UAM+CSIC, Madrid, Spain
| | - Esther Lopez-Garcia
- Department of Preventive Medicine and Public Health, Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid, Spain
- IMDEA-Food Institute. CEI UAM+CSIC, Madrid, Spain
| |
Collapse
|
3
|
Liang X, Sun H. Weighted Selection Probability to Prioritize Susceptible Rare Variants in Multi-Phenotype Association Studies with Application to a Soybean Genetic Data Set. J Comput Biol 2023; 30:1075-1088. [PMID: 37871292 DOI: 10.1089/cmb.2022.0487] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2023] Open
Abstract
Rare variant association studies with multiple traits or diseases have drawn a lot of attention since association signals of rare variants can be boosted if more than one phenotype outcome is associated with the same rare variants. Most of the existing statistical methods to identify rare variants associated with multiple phenotypes are based on a group test, where a pre-specified genetic region is tested one at a time. However, these methods are not designed to locate susceptible rare variants within the genetic region. In this article, we propose new statistical methods to prioritize rare variants within a genetic region when a group test for the genetic region identifies a statistical association with multiple phenotypes. It computes the weighted selection probability (WSP) of individual rare variants and ranks them from largest to smallest according to their WSP. In simulation studies, we demonstrated that the proposed method outperforms other statistical methods in terms of true positive selection, when multiple phenotypes are correlated with each other. We also applied it to our soybean single nucleotide polymorphism (SNP) data with 13 highly correlated amino acids, where we identified some potentially susceptible rare variants in chromosome 19.
Collapse
Affiliation(s)
- Xianglong Liang
- Department of Statistic, Pusan National University, Busan, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, Korea
| |
Collapse
|
4
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Rare variant association on unrelated individuals in case-control studies using aggregation tests: existing methods and current limitations. Brief Bioinform 2023; 24:bbad412. [PMID: 37974506 DOI: 10.1093/bib/bbad412] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2023] [Revised: 10/14/2023] [Accepted: 10/28/2023] [Indexed: 11/19/2023] Open
Abstract
Over the past years, progress made in next-generation sequencing technologies and bioinformatics have sparked a surge in association studies. Especially, genome-wide association studies (GWASs) have demonstrated their effectiveness in identifying disease associations with common genetic variants. Yet, rare variants can contribute to additional disease risk or trait heterogeneity. Because GWASs are underpowered for detecting association with such variants, numerous statistical methods have been recently proposed. Aggregation tests collapse multiple rare variants within a genetic region (e.g. gene, gene set, genomic loci) to test for association. An increasing number of studies using such methods successfully identified trait-associated rare variants and led to a better understanding of the underlying disease mechanism. In this review, we compare existing aggregation tests, their statistical features and scope of application, splitting them into the five classical classes: burden, adaptive burden, variance-component, omnibus and other. Finally, we describe some limitations of current aggregation tests, highlighting potential direction for further investigations.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, 1050 Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, 1050 Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, 1050 Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Avenue Hippocrate 74 (+5) bte B1.74.06, 1200 Brussels, Belgium
- WELBIO department, WEL Research Institute, avenue Pasteur, 6, 1300 Wavre, Belgium
| |
Collapse
|
5
|
Wu M, Hao S, Wang X, Su S, Du S, Zhou S, Yang R, Du H. A pyroptosis-related gene signature that predicts immune infiltration and prognosis in colon cancer. Front Oncol 2023; 13:1173181. [PMID: 37503314 PMCID: PMC10369052 DOI: 10.3389/fonc.2023.1173181] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 06/23/2023] [Indexed: 07/29/2023] Open
Abstract
Background Colon cancer (CC) is a highly heterogeneous malignancy associated with high morbidity and mortality. Pyroptosis is a type of programmed cell death characterized by an inflammatory response that can affect the tumor immune microenvironment and has potential prognostic and therapeutic value. The aim of this study was to evaluate the association between pyroptosis-related gene (PRG) expression and CC. Methods Based on the expression profiles of PRGs, we classified CC samples from The Cancer Gene Atlas and Gene Expression Omnibus databases into different clusters by unsupervised clustering analysis. The best prognostic signature was screened and established using least absolute shrinkage and selection operator (LASSO) and multivariate COX regression analyses. Subsequently, a nomogram was established based on multivariate COX regression analysis. Next, gene set enrichment analysis (GSEA) and gene set variation analysis (GSVA) were performed to explore the potential molecular mechanisms between the high- and low-risk groups and to explore the differences in clinicopathological characteristics, gene mutation characteristics, abundance of infiltrating immune cells, and immune microenvironment between the two groups. We also evaluated the association between common immune checkpoints and drug sensitivity using risk scores. The immunohistochemistry staining was utilized to confirm the expression of the selected genes in the prognostic model in CC. Results The 1163 CC samples were divided into two clusters (clusters A and B) based on the expression profiles of the 33 PRGs. Genes with prognostic value were screened from the DEGs between the two clusters, and an eight PRGs prognostic model was constructed. GSEA and GSVA of the high- and low-risk groups revealed that they were mainly enriched in inflammatory response-related pathways. Compared to those in the low-risk group, patients in the high-risk group had worse overall survival, an immunosuppressive microenvironment, and worse sensitivity to immunotherapy and drug treatment. Conclusion Our findings provide a foundation for future research targeting pyroptosis and new insights into prognosis and immunotherapy from the perspective of pyroptosis in CC.
Collapse
Affiliation(s)
- Mingjian Wu
- Department of Gastrointestinal Surgery, Panyu Maternal and Child Care Service Centre of Guangzhou (He Xian Memorial Affiliated Hospital of Southern Medical University), Guangzhou, China
| | - Shuai Hao
- Department of General Surgery, Jinling Hospital, Medical School of Nanjing University, Nanjing, China
| | - Xiaoxiang Wang
- The First Clinical Medical College, Guangdong Medical University, Zhanjiang, Zhanjiang, Guangdong, China
| | - Shuguang Su
- Department of Pathology, Panyu Maternal and Child Care Service Centre of Guangzhou (He Xian Memorial Affiliated Hospital of Southern Medical University), Guangzhou, China
| | - Siyuan Du
- Department of Pathology, Panyu Maternal and Child Care Service Centre of Guangzhou (He Xian Memorial Affiliated Hospital of Southern Medical University), Guangzhou, China
| | - Sitong Zhou
- Department of Dermatology, The First People’s Hospital of Foshan, Foshan, Guangdong, China
| | - Ronghua Yang
- Department of Burn and Plastic Surgery, Guangzhou First People’s Hospital, South China University of Technology, Guangzhou, Guangdong, China
| | - Hanpeng Du
- Department of Gastrointestinal Surgery, Panyu Maternal and Child Care Service Centre of Guangzhou (He Xian Memorial Affiliated Hospital of Southern Medical University), Guangzhou, China
| |
Collapse
|
6
|
Chu BB, Ko S, Zhou JJ, Jensen A, Zhou H, Sinsheimer JS, Lange K. Multivariate genome-wide association analysis by iterative hard thresholding. Bioinformatics 2023; 39:btad193. [PMID: 37067496 PMCID: PMC10133532 DOI: 10.1093/bioinformatics/btad193] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2022] [Revised: 04/07/2023] [Accepted: 04/13/2023] [Indexed: 04/18/2023] Open
Abstract
MOTIVATION In a genome-wide association study, analyzing multiple correlated traits simultaneously is potentially superior to analyzing the traits one by one. Standard methods for multivariate genome-wide association study operate marker-by-marker and are computationally intensive. RESULTS We present a sparsity constrained regression algorithm for multivariate genome-wide association study based on iterative hard thresholding and implement it in a convenient Julia package MendelIHT.jl. In simulation studies with up to 100 quantitative traits, iterative hard thresholding exhibits similar true positive rates, smaller false positive rates, and faster execution times than GEMMA's linear mixed models and mv-PLINK's canonical correlation analysis. On UK Biobank data with 470 228 variants, MendelIHT completed a three-trait joint analysis (n=185 656) in 20 h and an 18-trait joint analysis (n=104 264) in 53 h with an 80 GB memory footprint. In short, MendelIHT enables geneticists to fit a single regression model that simultaneously considers the effect of all SNPs and dozens of traits. AVAILABILITY AND IMPLEMENTATION Software, documentation, and scripts to reproduce our results are available from https://github.com/OpenMendel/MendelIHT.jl.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Seyoon Ko
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Jin J Zhou
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Aubrey Jensen
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Hua Zhou
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
| | - Janet S Sinsheimer
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Biostatistics, Fielding School of Public Health at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
| | - Kenneth Lange
- Department of Computational Medicine, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, CA 90095-1554, United States
- Department of Statistics at UCLA, Los Angeles, CA 90095-1554, United States
| |
Collapse
|
7
|
Survival Analysis with High-Dimensional Omics Data Using a Threshold Gradient Descent Regularization-Based Neural Network Approach. Genes (Basel) 2022; 13:genes13091674. [PMID: 36140842 PMCID: PMC9498566 DOI: 10.3390/genes13091674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 09/13/2022] [Accepted: 09/16/2022] [Indexed: 11/17/2022] Open
Abstract
Analysis of data with a censored survival response and high-dimensional omics measurements is now common. Most of the existing analyses are based on specific (semi)parametric models, in particular the Cox model. Such analyses may be limited by not having sufficient flexibility, for example, in accommodating nonlinearity. For categorical and continuous responses, neural networks (NNs) have provided a highly competitive alternative. Comparatively, NNs for censored survival data remain limited. Omics measurements are usually high-dimensional, and only a small subset is expected to be survival-associated. As such, regularized estimation and selection are needed. In the existing NN studies, this is usually achieved via penalization. In this article, we propose adopting the threshold gradient descent regularization (TGDR) technique, which has competitive performance (for example, when compared to penalization) and unique advantages in regression analysis, but has not been adopted with NNs. The TGDR-based NN has a highly sensible formulation and an architecture different from the unregularized and penalization-based ones. Simulations show its satisfactory performance. Its practical effectiveness is further established via the analysis of two cancer omics datasets. Overall, this study can provide a practical and useful new way in the NN paradigm for survival analysis with high-dimensional omics measurements.
Collapse
|
8
|
Caballero FF, Lana A, Struijk EA, Arias-Fernández L, Yévenes-Briones H, Cárdenas-Valladolid J, Salinero-Fort MÁ, Banegas JR, Rodríguez-Artalejo F, Lopez-Garcia E. Prospective Association Between Plasma Amino Acids And Multimorbidity In Older Adults. J Gerontol A Biol Sci Med Sci 2022; 78:637-644. [PMID: 35876753 DOI: 10.1093/gerona/glac144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2022] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Some amino acids have been associated with aging-related disorders and risk of physical impairment. The aim of this study was to assess the association between plasma concentrations of nine amino acids, including branched-chain and aromatic amino acids, and multimorbidity. METHODS This research uses longitudinal data from the Seniors-ENRICA 2 study, a population-based cohort from Spain which comprises non-institutionalized adults older than 65. Blood samples were extracted at baseline and after a follow-up period of two years for a total of 1488 subjects. Participants' information was linked with electronic health records. Chronic diseases were grouped into a list of 60 mutually exclusive conditions. A quantitative measure of multimorbidity, weighting morbidities by their regression coefficients on physical functioning, was employed and ranged from 0 to 100. Generalized estimating equation models were used to explore the relationship between plasma amino acids and multimorbidity, adjusting for sociodemographics, socioeconomic status and lifestyle behaviors. RESULTS The mean age of participants at baseline was 73.6 (SD = 4.2) years, 49.6% were women. Higher concentrations of glutamine [coef. per mmol/l (95% confidence interval = 10.1 (3.7, 16.6)], isoleucine [50.3 (21.7, 78.9)] and valine [15.5 (3.1, 28.0)] were significantly associated with higher multimorbidity scores, after adjusting for potential confounders. Body mass index could have influenced the relationship between isoleucine and multimorbidity (p = 0.016). CONCLUSIONS Amino acids could play a role in regulating aging-related diseases. Glutamine and branched-chain amino acids as isoleucine and valine are prospectively associated and could serve as risk markers for multimorbidity in older adults.
Collapse
Affiliation(s)
- Francisco Félix Caballero
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid
| | - Alberto Lana
- Department of Medicine. Universidad de Oviedo/ISPA, Oviedo
| | - Ellen A Struijk
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid
| | | | - Humberto Yévenes-Briones
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid
| | - Juan Cárdenas-Valladolid
- Dirección Técnica de Sistemas de Información. Gerencia Asistencial de Atención Primaria, Servicio Madrileño de Salud, Madrid.,Fundación de Investigación e Innovación Biosanitaria de Atención Primaria, Madrid.,Enfermería. Universidad Alfonso X El Sabio, Villanueva de la Cañada
| | - Miguel Ángel Salinero-Fort
- Fundación de Investigación e Innovación Biosanitaria de Atención Primaria, Madrid.,Subdirección General de Investigación Sanitaria. Consejería de Sanidad, Madrid.,Red de Investigación en Servicios de Salud en Enfermedades Crónicas.,Grupo de Envejecimiento y Fragilidad de las personas mayores. IdIPAZ, Madrid
| | - José R Banegas
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid
| | - Fernando Rodríguez-Artalejo
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid.,IMDEA-Food Institute. CEI UAM+CSIC, Madrid
| | - Esther Lopez-Garcia
- Department of Preventive Medicine and Public Health. Universidad Autónoma de Madrid and CIBER of Epidemiology and Public Health, Madrid.,IMDEA-Food Institute. CEI UAM+CSIC, Madrid
| |
Collapse
|
9
|
Liu J, Si Y, Niu Y, Zhang R. Projection quantile correlation and its use in high-dimensional grouped variable screening. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2021.107369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
10
|
Wang K, Liu Y, Lu G, Xiao J, Huang J, Lei L, Peng J, Li Y, Wei S. A functional methylation signature to predict the prognosis of Chinese lung adenocarcinoma based on TCGA. Cancer Med 2021; 11:281-294. [PMID: 34854250 PMCID: PMC8704183 DOI: 10.1002/cam4.4431] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 10/08/2021] [Accepted: 10/10/2021] [Indexed: 01/16/2023] Open
Abstract
Background Lung cancer is the leading cause of cancer morbidity and mortality worldwide, however, the individualized treatment is still unsatisfactory. DNA methylation can affect gene regulation and may be one of the most valuable biomarkers in predicting the prognosis of lung adenocarcinoma. This study was aimed to identify methylation CpG sites that may be used to predict lung adenocarcinoma prognosis. Methods The Cancer Genome Atlas (TCGA) database was used to detect methylation CpG sites associated with lung adenocarcinoma prognosis and construct a methylation signature model. Then, a Chinese cohort was carried out to estimate the association between methylation and lung adenocarcinoma prognosis. Biological function studies, including demethylation treatment, cell proliferative capacity, and gene expression changes in lung adenocarcinoma cell lines, were further performed. Results In the TCGA set, three methylation CpG sites were selected that were associated with lung adenocarcinoma prognosis (cg14517217, cg15386964, and cg18878992). The risk of mortality was increased in lung adenocarcinoma patients with the gradual increase level of methylation signature based on three methylation sites levels (HR = 45.30, 95% CI = 26.69–66.83; p < 0.001). The C‐statistic value increased to 0.77 when age, gender, and other clinical variables were added to the signature to prediction model. A similar situation was confirmed in Chinese lung adenocarcinoma cohort. In the biological function studies, the proliferative capacity of cell lines was inhibited when the cells were demethylated with 5‐aza‐2'‐deoxycytidine (5‐aza‐2dC). The mRNA and protein expression levels of SEPT9 and HIST1H2BH (cg14517217 and cg15386964) were downregulated with different concentrations of 5‐aza‐2dC treatment, while cg18878992 showed the opposite result. Conclusion This study is the first to develop a three‐CpG‐based model for lung adenocarcinoma, which is a practical and useful tool for prognostic prediction that has been validated in a Chinese population.
Collapse
Affiliation(s)
- Ke Wang
- Medical College, Hubei University of Arts and Science, Xiangyang, Hubei, China.,Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Ying Liu
- Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Guanzhong Lu
- Medical College, Hubei University of Arts and Science, Xiangyang, Hubei, China
| | - Jinrong Xiao
- Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Jiao Huang
- Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Lin Lei
- Department of Cancer Control, Shenzhen Center for Chronic Disease Control, Shenzhen, Guangdong, China
| | - Ji Peng
- Department of Cancer Control, Shenzhen Center for Chronic Disease Control, Shenzhen, Guangdong, China
| | - Yangkai Li
- Department of Thoracic Surgery, Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| | - Sheng Wei
- Department of Epidemiology and Biostatistics, Ministry of Education Key Laboratory of Environment and Health, School of Public Health, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, Hubei, China
| |
Collapse
|
11
|
Kim J, Shen J, Wang A, Mehrotra DV, Ko S, Zhou JJ, Zhou H. VCSEL: Prioritizing SNP-set by penalized variance component selection. Ann Appl Stat 2021; 15:1652-1672. [DOI: 10.1214/21-aoas1491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles
| | - Judong Shen
- Biostatistics and Research Decision Sciences, Merck & Co., Inc
| | - Anran Wang
- Biostatistics and Research Decision Sciences, Merck & Co., Inc
| | | | - Seyoon Ko
- Department of Biostatistics, University of California, Los Angeles
| | - Jin J. Zhou
- Department of Medicine, University of California, Los Angeles
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles
| |
Collapse
|
12
|
Mieth B, Rozier A, Rodriguez JA, Höhne MMC, Görnitz N, Müller KR. DeepCOMBI: explainable artificial intelligence for the analysis and discovery in genome-wide association studies. NAR Genom Bioinform 2021; 3:lqab065. [PMID: 34296082 PMCID: PMC8291080 DOI: 10.1093/nargab/lqab065] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 05/27/2021] [Accepted: 07/08/2021] [Indexed: 02/06/2023] Open
Abstract
Deep learning has revolutionized data science in many fields by greatly improving prediction performances in comparison to conventional approaches. Recently, explainable artificial intelligence has emerged as an area of research that goes beyond pure prediction improvement by extracting knowledge from deep learning methodologies through the interpretation of their results. We investigate such explanations to explore the genetic architectures of phenotypes in genome-wide association studies. Instead of testing each position in the genome individually, the novel three-step algorithm, called DeepCOMBI, first trains a neural network for the classification of subjects into their respective phenotypes. Second, it explains the classifiers’ decisions by applying layer-wise relevance propagation as one example from the pool of explanation techniques. The resulting importance scores are eventually used to determine a subset of the most relevant locations for multiple hypothesis testing in the third step. The performance of DeepCOMBI in terms of power and precision is investigated on generated datasets and a 2007 study. Verification of the latter is achieved by validating all findings with independent studies published up until 2020. DeepCOMBI is shown to outperform ordinary raw P-value thresholding and other baseline methods. Two novel disease associations (rs10889923 for hypertension, rs4769283 for type 1 diabetes) were identified.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | - Alexandre Rozier
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | - Juan Antonio Rodriguez
- CNAG-CRG, Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona 08003, Spain
| | - Marina M C Höhne
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| | | | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin 10587, Germany
| |
Collapse
|
13
|
Zou K, Kim KS, Kim K, Kang D, Park YH, Sun H, Ha BK, Ha J, Jun TH. Genetic Diversity and Genome-Wide Association Study of Seed Aspect Ratio Using a High-Density SNP Array in Peanut ( Arachis hypogaea L.). Genes (Basel) 2020; 12:E2. [PMID: 33375051 PMCID: PMC7822046 DOI: 10.3390/genes12010002] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2020] [Revised: 12/09/2020] [Accepted: 12/17/2020] [Indexed: 12/12/2022] Open
Abstract
Peanut (Arachis hypogaea L.) is one of the important oil crops of the world. In this study, we aimed to evaluate the genetic diversity of 384 peanut germplasms including 100 Korean germplasms and 284 core collections from the United States Department of Agriculture (USDA) using an Axiom_Arachis array with 58K single-nucleotide polymorphisms (SNPs). We evaluated the evolutionary relationships among 384 peanut germplasms using a genome-wide association study (GWAS) of seed aspect ratio data processed by ImageJ software. In total, 14,030 filtered polymorphic SNPs were identified from the peanut 58K SNP array. We identified five SNPs with significant associations to seed aspect ratio on chromosomes Aradu.A09, Aradu.A10, Araip.B08, and Araip.B09. AX-177640219 on chromosome Araip.B08 was the most significantly associated marker in GAPIT and Regularization method. Phosphoenolpyruvate carboxylase (PEPC) was found among the eleven genes within a linkage disequilibrium (LD) of the significant SNPs on Araip.B08 and could have a strong causal effect in determining seed aspect ratio. The results of the present study provide information and methods that are useful for further genetic and genomic studies as well as molecular breeding programs in peanuts.
Collapse
Affiliation(s)
- Kunyan Zou
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | | | - Kipoong Kim
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Dongwoo Kang
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Yu-Hyeon Park
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
| | - Hokeun Sun
- Department of Statistics, Pusan National University, Busan 46241, Korea; (K.K.); (H.S.)
| | - Bo-Keun Ha
- Department of Applied Plant Science, Chonnam National University, Gwangju 61186, Korea;
| | - Jungmin Ha
- Department of Plant Science, Gangneung-Wonju National University, Gangneung 25457, Korea;
| | - Tae-Hwan Jun
- Department of Plant Bioscience, Pusan National University, Miryang 50463, Korea; (K.Z.); (D.K.); (Y.-H.P.)
- Life and Industry Convergence Research Institute, Pusan National University, Miryang 50463, Korea
| |
Collapse
|
14
|
Selection probability of multivariate regularization to identify pleiotropic variants in genetic association studies. COMMUNICATIONS FOR STATISTICAL APPLICATIONS AND METHODS 2020. [DOI: 10.29220/csam.2020.27.5.535] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
|
15
|
Chu BB, Keys KL, German CA, Zhou H, Zhou JJ, Sobel EM, Sinsheimer JS, Lange K. Iterative hard thresholding in genome-wide association studies: Generalized linear models, prior weights, and double sparsity. Gigascience 2020; 9:giaa044. [PMID: 32491161 PMCID: PMC7268817 DOI: 10.1093/gigascience/giaa044] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Revised: 02/27/2020] [Accepted: 04/14/2020] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Consecutive testing of single nucleotide polymorphisms (SNPs) is usually employed to identify genetic variants associated with complex traits. Ideally one should model all covariates in unison, but most existing analysis methods for genome-wide association studies (GWAS) perform only univariate regression. RESULTS We extend and efficiently implement iterative hard thresholding (IHT) for multiple regression, treating all SNPs simultaneously. Our extensions accommodate generalized linear models, prior information on genetic variants, and grouping of variants. In our simulations, IHT recovers up to 30% more true predictors than SNP-by-SNP association testing and exhibits a 2-3 orders of magnitude decrease in false-positive rates compared with lasso regression. We also test IHT on the UK Biobank hypertension phenotypes and the Northern Finland Birth Cohort of 1966 cardiovascular phenotypes. We find that IHT scales to the large datasets of contemporary human genetics and recovers the plausible genetic variants identified by previous studies. CONCLUSIONS Our real data analysis and simulation studies suggest that IHT can (i) recover highly correlated predictors, (ii) avoid over-fitting, (iii) deliver better true-positive and false-positive rates than either marginal testing or lasso regression, (iv) recover unbiased regression coefficients, (v) exploit prior information and group-sparsity, and (vi) be used with biobank-sized datasets. Although these advances are studied for genome-wide association studies inference, our extensions are pertinent to other regression problems with large numbers of predictors.
Collapse
Affiliation(s)
- Benjamin B Chu
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Kevin L Keys
- Department of Medicine, University of California, San Francisco, 1701 Divisadero St, San Francisco, CA, 94115, USA
- Berkeley Institute of Data Science, University of California, Berkeley, 190 Doe Library, Berkeley, CA 94720, USA
| | - Christopher A German
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
| | - Jin J Zhou
- Division of Epidemiology and Biostatistics, University of Arizona, 1295 N. Martin Ave. Tucson, AZ, 85724, USA
| | - Eric M Sobel
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| | - Janet S Sinsheimer
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| | - Kenneth Lange
- Department of Computational Medicine, University of California, Los Angeles, 621 Charles E Young Dr S, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA, 90095 USA
| |
Collapse
|
16
|
Affiliation(s)
- Rok Blagus
- Institute for Biostatistics and Medical InformaticsFaculty of Medicine, University of Ljubljana Ljubljana Slovenia
| | - Jelle J. Goeman
- Biomedical Data SciencesLeiden University Medical Center Leiden The Netherlands
| |
Collapse
|
17
|
Kim K, Sun H. Incorporating genetic networks into case-control association studies with high-dimensional DNA methylation data. BMC Bioinformatics 2019; 20:510. [PMID: 31640538 PMCID: PMC6805595 DOI: 10.1186/s12859-019-3040-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2019] [Accepted: 08/21/2019] [Indexed: 12/23/2022] Open
Abstract
Background In human genetic association studies with high-dimensional gene expression data, it has been well known that statistical selection methods utilizing prior biological network knowledge such as genetic pathways and signaling pathways can outperform other methods that ignore genetic network structures in terms of true positive selection. In recent epigenetic research on case-control association studies, relatively many statistical methods have been proposed to identify cancer-related CpG sites and their corresponding genes from high-dimensional DNA methylation array data. However, most of existing methods are not designed to utilize genetic network information although methylation levels between linked genes in the genetic networks tend to be highly correlated with each other. Results We propose new approach that combines data dimension reduction techniques with network-based regularization to identify outcome-related genes for analysis of high-dimensional DNA methylation data. In simulation studies, we demonstrated that the proposed approach overwhelms other statistical methods that do not utilize genetic network information in terms of true positive selection. We also applied it to the 450K DNA methylation array data of the four breast invasive carcinoma cancer subtypes from The Cancer Genome Atlas (TCGA) project. Conclusions The proposed variable selection approach can utilize prior biological network information for analysis of high-dimensional DNA methylation array data. It first captures gene level signals from multiple CpG sites using data a dimension reduction technique and then performs network-based regularization based on biological network graph information. It can select potentially cancer-related genes and genetic pathways that were missed by the existing methods. Electronic supplementary material The online version of this article (10.1186/s12859-019-3040-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Kipoong Kim
- Department of Statistic, Pusan National University, Busan, 46241, Korea
| | - Hokeun Sun
- Department of Statistic, Pusan National University, Busan, 46241, Korea.
| |
Collapse
|
18
|
Luo S, Chen Z. Feature Selection by Canonical Correlation Search in High-Dimensional Multiresponse Models With Complex Group Structures. J Am Stat Assoc 2019. [DOI: 10.1080/01621459.2019.1609972] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Affiliation(s)
- Shan Luo
- Department of Statistics, Shanghai Jiao Tong University, Shanghai, China
| | - Zehua Chen
- Department of Statistics & Applied Probability, National University of Singapore, Singapore
| |
Collapse
|
19
|
Zhou H, Sinsheimer JS, Bates DM, Chu BB, German CA, Ji SS, Keys KL, Kim J, Ko S, Mosher GD, Papp JC, Sobel EM, Zhai J, Zhou JJ, Lange K. OPENMENDEL: a cooperative programming project for statistical genetics. Hum Genet 2019; 139:61-71. [PMID: 30915546 DOI: 10.1007/s00439-019-02001-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2018] [Accepted: 03/15/2019] [Indexed: 01/06/2023]
Abstract
Statistical methods for genome-wide association studies (GWAS) continue to improve. However, the increasing volume and variety of genetic and genomic data make computational speed and ease of data manipulation mandatory in future software. In our view, a collaborative effort of statistical geneticists is required to develop open source software targeted to genetic epidemiology. Our attempt to meet this need is called the OPENMENDEL project (https://openmendel.github.io). It aims to (1) enable interactive and reproducible analyses with informative intermediate results, (2) scale to big data analytics, (3) embrace parallel and distributed computing, (4) adapt to rapid hardware evolution, (5) allow cloud computing, (6) allow integration of varied genetic data types, and (7) foster easy communication between clinicians, geneticists, statisticians, and computer scientists. This article reviews and makes recommendations to the genetic epidemiology community in the context of the OPENMENDEL project.
Collapse
Affiliation(s)
- Hua Zhou
- Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, USA.
| | - Janet S Sinsheimer
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, USA.
| | - Douglas M Bates
- Department of Statistics, University of Wisconsin, Madison, USA
| | - Benjamin B Chu
- Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, USA
| | - Christopher A German
- Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, USA
| | - Sarah S Ji
- Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, USA
| | - Kevin L Keys
- Department of Medicine, University of California, San Francisco, USA
| | - Juhyun Kim
- Department of Biostatistics, UCLA Fielding School of Public Health, Los Angeles, USA
| | - Seyoon Ko
- Department of Statistics, Seoul National University, Seoul, South Korea
| | - Gordon D Mosher
- Departments of Statistics and Computer Science, University of California, Riverside, USA
| | - Jeanette C Papp
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, USA
| | - Eric M Sobel
- Department of Human Genetics, David Geffen School of Medicine at UCLA, Los Angeles, USA
| | - Jing Zhai
- Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, USA
| | - Jin J Zhou
- Department of Epidemiology and Biostatistics, Mel and Enid Zuckerman College of Public Health, University of Arizona, Tucson, USA
| | - Kenneth Lange
- Department of Biomathematics, David Geffen School of Medicine at UCLA, Los Angeles, USA.
| |
Collapse
|
20
|
Qian W, Li W, Sogawa Y, Fujimaki R, Yang X, Liu J. An Interactive Greedy Approach to Group Sparsity in High Dimensions. Technometrics 2019. [DOI: 10.1080/00401706.2018.1537897] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
Affiliation(s)
- Wei Qian
- Department of Applied Economics and Statistics, University of Delaware
| | - Wending Li
- Department of Computer Science, University of Rochester
| | | | | | - Xitong Yang
- Department of Computer Science, University of Rochester
| | - Ji Liu
- Department of Computer Science, University of Rochester
| |
Collapse
|
21
|
Katsevich E, Sabatti C. MULTILAYER KNOCKOFF FILTER: CONTROLLED VARIABLE SELECTION AT MULTIPLE RESOLUTIONS. Ann Appl Stat 2019; 13:1-33. [PMID: 31687060 PMCID: PMC6827557 DOI: 10.1214/18-aoas1185] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We tackle the problem of selecting from among a large number of variables those that are "important" for an outcome. We consider situations where groups of variables are also of interest. For example, each variable might be a genetic polymorphism, and we might want to study how a trait depends on variability in genes, segments of DNA that typically contain multiple such polymorphisms. In this context, to discover that a variable is relevant for the outcome implies discovering that the larger entity it represents is also important. To guarantee meaningful results with high chance of replicability, we suggest controlling the rate of false discoveries for findings at the level of individual variables and at the level of groups. Building on the knockoff construction of Barber and Candès [Ann. Statist. 43 (2015) 2055-2085] and the multilayer testing framework of Barber and Ramdas [J. Roy. Statist. Soc. Ser. B 79 (2017) 1247-1268], we introduce the multilayer knockoff filter (MKF). We prove that MKF simultaneously controls the FDR at each resolution and use simulations to show that it incurs little power loss compared to methods that provide guarantees only for the discoveries of individual variables. We apply MKF to analyze a genetic dataset and find that it successfully reduces the number of false gene discoveries without a significant reduction in power.
Collapse
Affiliation(s)
- Eugene Katsevich
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| | - Chiara Sabatti
- DEPARTMENT OF STATISTICS, STANFORD UNIVERSITY, 390 SERRA MALL, STANFORD, CALIFORNIA 94305, ,
| |
Collapse
|
22
|
Fan X, Wang H, Sun L, Zheng X, Yin X, Zuo X, Peng Q, Standish KA, Cheng H, Zhang Y, Wang Z, Xiao F, Yang S, Zhang X, Schork NJ. Fine mapping and subphenotyping implicates ADRA1B gene variants in psoriasis susceptibility in a Chinese population. Epigenomics 2019; 11:455-467. [PMID: 30785334 DOI: 10.2217/epi-2018-0131] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
AIM A genomic region on 5q33.3 lies between and encompasses the IL12B and PTTG1 genes, and contains many potential psoriasis causal variants. We aimed to further examine the influence of variants in and around this region. MATERIALS & METHODS We used least absolute shrinkage and selection operator (LASSO)-based regression analysis to assess independent contributions of 2171 variants to psoriasis susceptibility and tested them for association with different clinical psoriasis subtypes. RESULTS We found that ADRA1B gene variants contribute to psoriasis in Chinese population. ADRA1B gene variants have a stronger association with moderate-to-severe disease group and an earlier age at onset of psoriasis than IL-12B and PTTG1 variants. CONCLUSION The association of variants in the ADRA1B gene with psoriasis could explain why variants in the IL-12B, ADRA1B and PTTG1 gene regions are associated with psoriasis.
Collapse
Affiliation(s)
- Xing Fan
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Hongyan Wang
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Liangdan Sun
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Xiaodong Zheng
- Instituteof Dermatology, Anhui Medical University, 81 Meishan Road, Shushan District, Hefei City, Anhui, 230032, PR China
| | - Xianyong Yin
- Instituteof Dermatology, Anhui Medical University, 81 Meishan Road, Shushan District, Hefei City, Anhui, 230032, PR China
| | - Xianbo Zuo
- Instituteof Dermatology, Anhui Medical University, 81 Meishan Road, Shushan District, Hefei City, Anhui, 230032, PR China
| | - Qian Peng
- Molecular& Cellular Neuroscience, The Scripps Research Institute, 10550 North TorreyPines Road, La Jolla, CA 92037, USA
| | - Kristopher A Standish
- Genomics, Bioinformatics, J. Craig Venter Institute, 4120 Capricorn Lane, La Jolla, CA92037, USA
| | - Hui Cheng
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Yaohua Zhang
- Instituteof Dermatology, Department of Dermatology, Huashan Hospital, Fudan University, No.12, Middle Urumqi Road, Shanghai, 200040, PR China
| | - Zaixing Wang
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Fengli Xiao
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Sen Yang
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Xuejun Zhang
- Departmentof Dermatology, Anhui Medical University, The First Affiliated Hospital ofAnhui Medical University, 218 Jixi Road, Shushan District, Hefei City, Anhui, 230022, PR China
| | - Nicholas J Schork
- HumanBiology, J. Craig Venter Institute, 4120 Capricorn Lane, La Jolla, CA 92037, USA
| |
Collapse
|
23
|
Genomic prediction of relapse in recipients of allogeneic haematopoietic stem cell transplantation. Leukemia 2018; 33:240-248. [PMID: 30089915 PMCID: PMC6326954 DOI: 10.1038/s41375-018-0229-3] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Revised: 06/21/2018] [Accepted: 07/17/2018] [Indexed: 02/06/2023]
Abstract
Allogeneic haematopoietic stem cell transplantation currently represents the primary potentially curative treatment for cancers of the blood and bone marrow. While relapse occurs in approximately 30% of patients, few risk-modifying genetic variants have been identified. The present study evaluates the predictive potential of patient genetics on relapse risk in a genome-wide manner. We studied 151 graft recipients with HLA-matched sibling donors by sequencing the whole-exome, active immunoregulatory regions, and the full MHC region. To assess the predictive capability and contributions of SNPs and INDELs, we employed machine learning and a feature selection approach in a cross-validation framework to discover the most informative variants while controlling against overfitting. Our results show that germline genetic polymorphisms in patients entail a significant contribution to relapse risk, as judged by the predictive performance of the model (AUC = 0.72 [95% CI: 0.63-0.81]). Furthermore, the top contributing variants were predictive in two independent replication cohorts (n = 258 and n = 125) from the same population. The results can help elucidate relapse mechanisms and suggest novel therapeutic targets. A computational genomic model could provide a step toward individualized prognostic risk assessment, particularly when accompanied by other data modalities.
Collapse
|
24
|
Choi J, Kim K, Sun H. New variable selection strategy for analysis of high-dimensional DNA methylation data. J Bioinform Comput Biol 2018; 16:1850010. [PMID: 29954287 DOI: 10.1142/s0219720018500105] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
In genetic association studies, regularization methods are often used due to their computational efficiency for analysis of high-dimensional genomic data. DNA methylation data generated from Infinium HumanMethylation450 BeadChip Kit have a group structure where an individual gene consists of multiple Cytosine-phosphate-Guanine (CpG) sites. Consequently, group-based regularization can precisely detect outcome-related CpG sites. Representative examples are sparse group lasso (SGL) and network-based regularization. The former is powerful when most of the CpG sites within the same gene are associated with a phenotype outcome. In contrast, the latter is preferred when only a few of the CpG sites within the same gene are related to the outcome. In this paper, we propose new variable selection strategy based on a selection probability that measures selection frequency of individual variables selected by both SGL and network-based regularization. In extensive simulation study, we demonstrated that the proposed strategy can show relatively outstanding selection performance under any situation, compared with both SGL and network-based regularization. Also, we applied the proposed strategy to identify differentially methylated CpG sites and their corresponding genes from ovarian cancer data.
Collapse
Affiliation(s)
- Jiyun Choi
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Kipoong Kim
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| | - Hokeun Sun
- 1 Department of Statistics, Pusan National University, Busan 46241, Korea
| |
Collapse
|
25
|
Mat AM, Klopp C, Payton L, Jeziorski C, Chalopin M, Amzil Z, Tran D, Wikfors GH, Hégaret H, Soudant P, Huvet A, Fabioux C. Oyster transcriptome response to Alexandrium exposure is related to saxitoxin load and characterized by disrupted digestion, energy balance, and calcium and sodium signaling. AQUATIC TOXICOLOGY (AMSTERDAM, NETHERLANDS) 2018; 199:127-137. [PMID: 29621672 DOI: 10.1016/j.aquatox.2018.03.030] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/06/2018] [Revised: 03/22/2018] [Accepted: 03/25/2018] [Indexed: 06/08/2023]
Abstract
Harmful Algal Blooms are worldwide occurrences that can cause poisoning in human seafood consumers as well as mortality and sublethal effets in wildlife, propagating economic losses. One of the most widespread toxigenic microalgal taxa is the dinoflagellate Genus Alexandrium, that includes species producing neurotoxins referred to as PST (Paralytic Shellfish Toxins). Blooms cause shellfish harvest restrictions to protect human consumers from accumulated toxins. Large inter-individual variability in toxin load within an exposed bivalve population complicates monitoring of shellfish toxicity for ecology and human health regulation. To decipher the physiological pathways involved in the bivalve response to PST, we explored the whole transcriptome of the digestive gland of the Pacific oyster Crassostrea gigas fed experimentally with a toxic Alexandrium minutum culture. The largest differences in transcript abundance were between oysters with contrasting toxin loads (1098 transcripts), rather than between exposed and non-exposed oysters (16 transcripts), emphasizing the importance of toxin load in oyster response to toxic dinoflagellates. Additionally, penalized regressions, innovative in this field, modeled accurately toxin load based upon only 70 transcripts. Transcriptomic differences between oysters with contrasting PST burdens revealed a limited suite of metabolic pathways affected, including ion channels, neuromuscular communication, and digestion, all of which are interconnected and linked to sodium and calcium exchanges. Carbohydrate metabolism, unconsidered previously in studies of harmful algal effects on shellfish, was also highlighted, suggesting energy challenge in oysters with high toxin loads. Associations between toxin load, genotype, and mRNA levels were revealed that open new doors for genetic studies identifying genetically-based low toxin accumulation.
Collapse
Affiliation(s)
- Audrey M Mat
- Ifremer, LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, CS 10070, 29280 Plouzané, France
| | | | - Laura Payton
- UMR 5805 EPOC, CNRS - Université de Bordeaux, F-33120 Arcachon, France
| | | | - Morgane Chalopin
- Ifremer, LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, CS 10070, 29280 Plouzané, France
| | - Zouher Amzil
- Ifremer, Laboratoire Phycotoxines, rue de l'Ile d'Yeu, BP 21105, F-44311 Nantes, France
| | - Damien Tran
- UMR 5805 EPOC, CNRS - Université de Bordeaux, F-33120 Arcachon, France
| | - Gary H Wikfors
- Northeast Fisheries Science Center, NOAA National Marine Fisheries Service, 212 Rogers Avenue, Milford, CT 06460, USA
| | - Hélène Hégaret
- LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, IUEM, rue Dumont d'Urville, 29280 Plouzané, France
| | - Philippe Soudant
- LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, IUEM, rue Dumont d'Urville, 29280 Plouzané, France
| | - Arnaud Huvet
- Ifremer, LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, CS 10070, 29280 Plouzané, France
| | - Caroline Fabioux
- LEMAR UMR 6539 CNRS/UBO/IRD/Ifremer, IUEM, rue Dumont d'Urville, 29280 Plouzané, France.
| |
Collapse
|
26
|
Keys KL, Chen GK, Lange K. Iterative hard thresholding for model selection in genome-wide association studies. Genet Epidemiol 2017; 41:756-768. [PMID: 28875524 DOI: 10.1002/gepi.22068] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2017] [Revised: 07/13/2017] [Accepted: 08/02/2017] [Indexed: 11/05/2022]
Abstract
A genome-wide association study (GWAS) correlates marker and trait variation in a study sample. Each subject is genotyped at a multitude of SNPs (single nucleotide polymorphisms) spanning the genome. Here, we assume that subjects are randomly collected unrelateds and that trait values are normally distributed or can be transformed to normality. Over the past decade, geneticists have been remarkably successful in applying GWAS analysis to hundreds of traits. The massive amount of data produced in these studies present unique computational challenges. Penalized regression with the ℓ1 penalty (LASSO) or minimax concave penalty (MCP) penalties is capable of selecting a handful of associated SNPs from millions of potential SNPs. Unfortunately, model selection can be corrupted by false positives and false negatives, obscuring the genetic underpinning of a trait. Here, we compare LASSO and MCP penalized regression to iterative hard thresholding (IHT). On GWAS regression data, IHT is better at model selection and comparable in speed to both methods of penalized regression. This conclusion holds for both simulated and real GWAS data. IHT fosters parallelization and scales well in problems with large numbers of causal markers. Our parallel implementation of IHT accommodates SNP genotype compression and exploits multiple CPU cores and graphics processing units (GPUs). This allows statistical geneticists to leverage commodity desktop computers in GWAS analysis and to avoid supercomputing. AVAILABILITY Source code is freely available at https://github.com/klkeys/IHT.jl.
Collapse
Affiliation(s)
- Kevin L Keys
- Department of Medicine, University of California, San Francisco, San Francisco, California, United States of America
| | - Gary K Chen
- Division of Biostatistics, University of Southern California, Los Angeles, California, United States of America
| | - Kenneth Lange
- Departments of Biomathematics, Human Genetics, and Statistics, University of California, Los Angeles, California, United States of America
| |
Collapse
|
27
|
Abstract
Despite thousands of genetic loci identified to date, a large proportion of genetic variation predisposing to complex disease and traits remains unaccounted for. Advances in sequencing technology enable focused explorations on the contribution of low-frequency and rare variants to human traits. Here we review experimental approaches and current knowledge on the contribution of these genetic variants in complex disease and discuss challenges and opportunities for personalised medicine.
Collapse
Affiliation(s)
- Lorenzo Bomba
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Klaudia Walter
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK
| | - Nicole Soranzo
- Human Genetics, Wellcome Trust Sanger Institute, Genome Campus, Hinxton, CB10 1HH, UK. .,Department of Haematology, University of Cambridge, Hills Rd, Cambridge, CB2 0AH, UK. .,The National Institute for Health Research Blood and Transplant Unit (NIHR BTRU) in Donor Health and Genomics at the University of Cambridge, University of Cambridge, Strangeways Research Laboratory, Wort's Causeway, Cambridge, CB1 8RN, UK.
| |
Collapse
|
28
|
Longitudinal data analysis for rare variants detection with penalized quadratic inference function. Sci Rep 2017; 7:650. [PMID: 28381821 PMCID: PMC5429681 DOI: 10.1038/s41598-017-00712-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2016] [Accepted: 03/08/2017] [Indexed: 11/08/2022] Open
Abstract
Longitudinal genetic data provide more information regarding genetic effects over time compared with cross-sectional data. Coupled with next-generation sequencing technologies, it becomes reality to identify important genes containing both rare and common variants in a longitudinal design. In this work, we adopted a weighted sum statistic (WSS) to collapse multiple variants in a gene region to form a gene score. When multiple genes in a pathway were considered together, a penalized longitudinal model under the quadratic inference function (QIF) framework was applied for efficient gene selection. We evaluated the estimation accuracy and model selection performance under different model settings, then applied the method to a real dataset from the Genetic Analysis Workshop 18 (GAW18). Compared with the unpenalized QIF method, the penalized QIF (pQIF) method achieved better estimation accuracy and higher selection efficiency. The pQIF remained optimal even when the working correlation structure was mis-specified. The real data analysis identified one important gene, angiotensin II receptor type 1 (AGTR1), in the Ca2+/AT-IIR/α-AR signaling pathway. The estimated effect implied that AGTR1 may have a protective effect for hypertension. Our pQIF method provides a general tool for longitudinal sequencing studies involving large numbers of genetic variants.
Collapse
|
29
|
Zhou H, Blangero J, Dyer TD, Chan KHK, Lange K, Sobel EM. Fast Genome-Wide QTL Association Mapping on Pedigree and Population Data. Genet Epidemiol 2017; 41:174-186. [PMID: 27943406 PMCID: PMC5340631 DOI: 10.1002/gepi.21988] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2015] [Revised: 05/02/2016] [Accepted: 05/08/2016] [Indexed: 01/14/2023]
Abstract
Since most analysis software for genome-wide association studies (GWAS) currently exploit only unrelated individuals, there is a need for efficient applications that can handle general pedigree data or mixtures of both population and pedigree data. Even datasets thought to consist of only unrelated individuals may include cryptic relationships that can lead to false positives if not discovered and controlled for. In addition, family designs possess compelling advantages. They are better equipped to detect rare variants, control for population stratification, and facilitate the study of parent-of-origin effects. Pedigrees selected for extreme trait values often segregate a single gene with strong effect. Finally, many pedigrees are available as an important legacy from the era of linkage analysis. Unfortunately, pedigree likelihoods are notoriously hard to compute. In this paper, we reexamine the computational bottlenecks and implement ultra-fast pedigree-based GWAS analysis. Kinship coefficients can either be based on explicitly provided pedigrees or automatically estimated from dense markers. Our strategy (a) works for random sample data, pedigree data, or a mix of both; (b) entails no loss of power; (c) allows for any number of covariate adjustments, including correction for population stratification; (d) allows for testing SNPs under additive, dominant, and recessive models; and (e) accommodates both univariate and multivariate quantitative traits. On a typical personal computer (six CPU cores at 2.67 GHz), analyzing a univariate HDL (high-density lipoprotein) trait from the San Antonio Family Heart Study (935,392 SNPs on 1,388 individuals in 124 pedigrees) takes less than 2 min and 1.5 GB of memory. Complete multivariate QTL analysis of the three time-points of the longitudinal HDL multivariate trait takes less than 5 min and 1.5 GB of memory. The algorithm is implemented as the Ped-GWAS Analysis (Option 29) in the Mendel statistical genetics package, which is freely available for Macintosh, Linux, and Windows platforms from http://genetics.ucla.edu/software/mendel.
Collapse
Affiliation(s)
- Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, California, United States of America
| | - John Blangero
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Texas, United States of America
| | - Thomas D Dyer
- South Texas Diabetes and Obesity Institute, University of Texas Rio Grande Valley, Texas, United States of America
| | - Kei-Hang K Chan
- Department of Human Genetics, University of California, Los Angeles, California, United States of America
- Department of Epidemiology, University of California, Los Angeles, California, United States of America
| | - Kenneth Lange
- Department of Human Genetics, University of California, Los Angeles, California, United States of America
- Department of Biomathematics, University of California, Los Angeles, California, United States of America
- Department of Statistics, University of California, Los Angeles, California, United States of America
| | - Eric M Sobel
- Department of Human Genetics, University of California, Los Angeles, California, United States of America
| |
Collapse
|
30
|
Brzyski D, Peterson CB, Sobczyk P, Candès EJ, Bogdan M, Sabatti C. Controlling the Rate of GWAS False Discoveries. Genetics 2017; 205:61-75. [PMID: 27784720 PMCID: PMC5223524 DOI: 10.1534/genetics.116.193987] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 10/11/2016] [Indexed: 01/13/2023] Open
Abstract
With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.
Collapse
Affiliation(s)
- Damian Brzyski
- Institute of Mathematics, Jagiellonian University, 30-348 Kraków, Poland
- Department of Epidemiology and Biostatistics, Indiana University, Bloomington, Indiana 47405
| | - Christine B Peterson
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas 77030
| | - Piotr Sobczyk
- Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, 50-370 Wroclaw, Poland
| | | | - Malgorzata Bogdan
- Institute of Mathematics, University of Wrocław, 50-384 Wroclaw, Poland
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, California
| |
Collapse
|
31
|
Wang C, Ruggeri F, Hsiao CK, Argiento R. Bayesian nonparametric clustering and association studies for candidate SNP observations. Int J Approx Reason 2017. [DOI: 10.1016/j.ijar.2016.07.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
32
|
Mieth B, Kloft M, Rodríguez JA, Sonnenburg S, Vobruba R, Morcillo-Suárez C, Farré X, Marigorta UM, Fehr E, Dickhaus T, Blanchard G, Schunk D, Navarro A, Müller KR. Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies. Sci Rep 2016; 6:36671. [PMID: 27892471 PMCID: PMC5125008 DOI: 10.1038/srep36671] [Citation(s) in RCA: 35] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2016] [Accepted: 10/06/2016] [Indexed: 12/21/2022] Open
Abstract
The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008-2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.
Collapse
Affiliation(s)
- Bettina Mieth
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Marius Kloft
- Department of Computer Science, Humboldt University of Berlin, Berlin, 10099, Germany
| | - Juan Antonio Rodríguez
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | | | - Robin Vobruba
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
| | - Carlos Morcillo-Suárez
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | - Xavier Farré
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
| | - Urko M. Marigorta
- School of Biology, Georgia Institute of Technology, Atlanta, 30332, GA, USA
| | - Ernst Fehr
- Department of Economics, Laboratory for Social and Neural Systems Research, University of Zurich, Zurich, 8006, Switzerland
| | - Thorsten Dickhaus
- Institute for Statistics (FB 3), University of Bremen, Bremen, 28359, Germany
| | - Gilles Blanchard
- Department of Mathematics, University of Potsdam, Potsdam, 14476, Germany
| | - Daniel Schunk
- Department of Economics, University of Mainz, Mainz, 55099, Germany
| | - Arcadi Navarro
- Institut de Biología Evolutiva (CSIC-UPF). Departament de Ciències Experimentals i de la Salut. Universitat Pompeu Fabra, Barcelona, 08003, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, 08010, Spain
- Center for Genomic Regulation (CRG), Barcelona Institute of Science and Technology (BIST), Barcelona, 08003, Spain
| | - Klaus-Robert Müller
- Machine Learning Group, Technische Universität Berlin, Berlin, 10587, Germany
- Department of Brain and Cognitive Engineering, Korea University, Seoul, Republic of Korea
| |
Collapse
|
33
|
Abstract
Background Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of sequence data with rare variants in a cost-effective way. Yet, the statistical aspect of testing disease association of rare variants is quite challenging as the typical assumptions fail to hold owing to low minor allele frequency (<0.5 or 1 %). Methods I present a Bayesian variable selection approach to detect associations with both rare and common genetic variants for quantitative traits simultaneously. In my model, I frame the problem of identifying disease-associated variants as a problem of variable selection in a sparse space, that is, how best to model the relationship between phenotypes and a set of genetic variants. By constructing a risk index score for a group of rare variants, my method can effectively consider all variants in a multivariate model. I also use a within-chain permutation to generate the empirical thresholds to detect true-positive variants. Results I apply our method to study the association between increases in baseline systolic and diastolic blood pressure (SBP and DBP, respectively) and genetic variants in the data from Genetic Analysis Workshop 19 unrelated samples. I identify several rare and common variants in the gene MAP4 that are potentially associated with SBP and DBP. Conclusions The application shows that my method is powerful in identifying disease-associated variants even with the extreme rarity.
Collapse
|
34
|
Larson NB, McDonnell S, Albright LC, Teerlink C, Stanford J, Ostrander EA, Isaacs WB, Xu J, Cooney KA, Lange E, Schleutker J, Carpten JD, Powell I, Bailey-Wilson J, Cussenot O, Cancel-Tassin G, Giles G, MacInnis R, Maier C, Whittemore AS, Hsieh CL, Wiklund F, Catolona WJ, Foulkes W, Mandal D, Eeles R, Kote-Jarai Z, Ackerman MJ, Olson TM, Klein CJ, Thibodeau SN, Schaid DJ. Post hoc Analysis for Detecting Individual Rare Variant Risk Associations Using Probit Regression Bayesian Variable Selection Methods in Case-Control Sequencing Studies. Genet Epidemiol 2016; 40:461-9. [PMID: 27312771 PMCID: PMC5063501 DOI: 10.1002/gepi.21983] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2015] [Revised: 04/22/2016] [Accepted: 04/27/2016] [Indexed: 12/27/2022]
Abstract
Rare variants (RVs) have been shown to be significant contributors to complex disease risk. By definition, these variants have very low minor allele frequencies and traditional single-marker methods for statistical analysis are underpowered for typical sequencing study sample sizes. Multimarker burden-type approaches attempt to identify aggregation of RVs across case-control status by analyzing relatively small partitions of the genome, such as genes. However, it is generally the case that the aggregative measure would be a mixture of causal and neutral variants, and these omnibus tests do not directly provide any indication of which RVs may be driving a given association. Recently, Bayesian variable selection approaches have been proposed to identify RV associations from a large set of RVs under consideration. Although these approaches have been shown to be powerful at detecting associations at the RV level, there are often computational limitations on the total quantity of RVs under consideration and compromises are necessary for large-scale application. Here, we propose a computationally efficient alternative formulation of this method using a probit regression approach specifically capable of simultaneously analyzing hundreds to thousands of RVs. We evaluate our approach to detect causal variation on simulated data and examine sensitivity and specificity in instances of high RV dimensionality as well as apply it to pathway-level RV analysis results from a prostate cancer (PC) risk case-control sequencing study. Finally, we discuss potential extensions and future directions of this work.
Collapse
Affiliation(s)
- Nicholas B. Larson
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Shannon McDonnell
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| | - Lisa Cannon Albright
- Dept. Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT
| | - Craig Teerlink
- Dept. Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT
| | | | | | | | - Jianfeng Xu
- NorthShore University Health System Research Institute, Chicago, IL
| | - Kathleen A. Cooney
- Depts. of Internal Medicine and Urology, University of Michigan Medical School, Ann Arbor, MI
| | - Ethan Lange
- Dept. of Genetics, University of North Carolina, Chapel Hill, NC
| | - Johanna Schleutker
- Dept. of Medical Biochemistry and Genetics, Institute of Biomedicine, University of Turku, Finland
| | - John D. Carpten
- Integrated Cancer Genomics Division, The Translational Genomics Research Institute, Phoenix, AZ
| | | | - Joan Bailey-Wilson
- Statistical Genetics Section, National Human Genome Research Institute, Bethesda, MD
| | | | | | - Graham Giles
- Cancer Epidemiology Centre, Cancer Council Victoria, and Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | - Robert MacInnis
- Cancer Epidemiology Centre, Cancer Council Victoria, and Centre for Epidemiology and Biostatistics, School of Population and Global Health, University of Melbourne, Melbourne, Australia
| | | | | | - Chih-Lin Hsieh
- Dept. of Urology, University of Southern California, Los Angeles, CA
| | - Fredrik Wiklund
- Dept. of Medical Epidemiology and Biostatistics, Karolinska Institutet, Stockholm, Sweden
| | | | - William Foulkes
- Depts. Of Oncology and Human Genetics, Montreal General Hospital, Montreal QC, Canada
| | - Diptasri Mandal
- Dept. of Genetics, LSU Health Sciences Center, New Orleans, LA
| | - Rosalind Eeles
- Genetics and Epidemiology, Institute of Cancer Research, Sutton Surrey, UK
| | - Zsofia Kote-Jarai
- Genetics and Epidemiology, Institute of Cancer Research, Sutton Surrey, UK
| | | | - Timothy M. Olson
- Dept. of Pediatric and Adolescent Medicine, Mayo Clinic, Rochester, MN
| | | | | | - Daniel J. Schaid
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, MN
| |
Collapse
|
35
|
He Q, Cai T, Liu Y, Zhao N, Harmon QE, Almli LM, Binder EB, Engel SM, Ressler KJ, Conneely KN, Lin X, Wu MC. Prioritizing individual genetic variants after kernel machine testing using variable selection. Genet Epidemiol 2016; 40:722-731. [PMID: 27488097 DOI: 10.1002/gepi.21993] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2015] [Revised: 05/28/2016] [Accepted: 06/20/2016] [Indexed: 01/06/2023]
Abstract
Kernel machine learning methods, such as the SNP-set kernel association test (SKAT), have been widely used to test associations between traits and genetic polymorphisms. In contrast to traditional single-SNP analysis methods, these methods are designed to examine the joint effect of a set of related SNPs (such as a group of SNPs within a gene or a pathway) and are able to identify sets of SNPs that are associated with the trait of interest. However, as with many multi-SNP testing approaches, kernel machine testing can draw conclusion only at the SNP-set level, and does not directly inform on which one(s) of the identified SNP set is actually driving the associations. A recently proposed procedure, KerNel Iterative Feature Extraction (KNIFE), provides a general framework for incorporating variable selection into kernel machine methods. In this article, we focus on quantitative traits and relatively common SNPs, and adapt the KNIFE procedure to genetic association studies and propose an approach to identify driver SNPs after the application of SKAT to gene set analysis. Our approach accommodates several kernels that are widely used in SNP analysis, such as the linear kernel and the Identity by State (IBS) kernel. The proposed approach provides practically useful utilities to prioritize SNPs, and fills the gap between SNP set analysis and biological functional studies. Both simulation studies and real data application are used to demonstrate the proposed approach.
Collapse
Affiliation(s)
- Qianchuan He
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Tianxi Cai
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Yang Liu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Ni Zhao
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| | - Quaker E Harmon
- Epidemiology Branch, NIEHS, Research Triangle Park, North Carolina, United States of America
| | - Lynn M Almli
- Department of Psychiatry and Behavioral Sciences, Emory University School of Medicine, Atlanta, Georgia, United States of America
| | - Elisabeth B Binder
- Department of Translational Research in Psychiatry, Max-Planck Institute of Psychiatry, Munich, Germany
| | - Stephanie M Engel
- Department of Epidemiology, University of North Carolina, Chapel Hill, North Carolina, United States of America
| | - Kerry J Ressler
- Division of Depression & Anxiety Disorders, McLean Hospital, Belmont, Massachusetts, United States of America
| | - Karen N Conneely
- Department of Human Genetics, Emory University School of Medicine, Atlanta, Georgia, United States of America
| | - Xihong Lin
- Department of Biostatistics, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Michael C Wu
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, United States of America
| |
Collapse
|
36
|
Abstract
Over the past few years, interest in the identification of rare variants that influence human phenotype has led to the development of many statistical methods for testing for association between sets of rare variants and binary or quantitative traits. Here, I review some of the most important ideas that underlie these methods and the most relevant issues when choosing a method for analysis. In addition to the tests for association, I review crucial issues in performing a rare variant study, from experimental design to interpretation and validation. I also discuss the many challenges of these studies, some of their limitations, and future research directions.
Collapse
Affiliation(s)
- Dan L Nicolae
- Departments of Medicine and Statistics, University of Chicago, Chicago, Illinois 60637;
| |
Collapse
|
37
|
Mallick H, Tiwari HK. EM Adaptive LASSO-A Multilocus Modeling Strategy for Detecting SNPs Associated with Zero-inflated Count Phenotypes. Front Genet 2016; 7:32. [PMID: 27066062 PMCID: PMC4811966 DOI: 10.3389/fgene.2016.00032] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2015] [Accepted: 02/22/2016] [Indexed: 11/13/2022] Open
Abstract
Count data are increasingly ubiquitous in genetic association studies, where it is possible to observe excess zero counts as compared to what is expected based on standard assumptions. For instance, in rheumatology, data are usually collected in multiple joints within a person or multiple sub-regions of a joint, and it is not uncommon that the phenotypes contain enormous number of zeroes due to the presence of excessive zero counts in majority of patients. Most existing statistical methods assume that the count phenotypes follow one of these four distributions with appropriate dispersion-handling mechanisms: Poisson, Zero-inflated Poisson (ZIP), Negative Binomial, and Zero-inflated Negative Binomial (ZINB). However, little is known about their implications in genetic association studies. Also, there is a relative paucity of literature on their usefulness with respect to model misspecification and variable selection. In this article, we have investigated the performance of several state-of-the-art approaches for handling zero-inflated count data along with a novel penalized regression approach with an adaptive LASSO penalty, by simulating data under a variety of disease models and linkage disequilibrium patterns. By taking into account data-adaptive weights in the estimation procedure, the proposed method provides greater flexibility in multi-SNP modeling of zero-inflated count phenotypes. A fast coordinate descent algorithm nested within an EM (expectation-maximization) algorithm is implemented for estimating the model parameters and conducting variable selection simultaneously. Results show that the proposed method has optimal performance in the presence of multicollinearity, as measured by both prediction accuracy and empirical power, which is especially apparent as the sample size increases. Moreover, the Type I error rates become more or less uncontrollable for the competing methods when a model is misspecified, a phenomenon routinely encountered in practice.
Collapse
Affiliation(s)
- Himel Mallick
- Department of Biostatistics, Harvard T. H. Chan School of Public Health, Harvard UniversityBoston, MA, USA; Program of Medical and Population Genetics, Broad Institute of MIT and HarvardCambridge, MA, USA
| | - Hemant K Tiwari
- Section on Statistical Genetics, Department of Biostatistics, School of Public Health, University of Alabama at Birmingham Birmingham, AL, USA
| |
Collapse
|
38
|
Stell L, Sabatti C. Genetic Variant Selection: Learning Across Traits and Sites. Genetics 2016; 202:439-55. [PMID: 26680660 PMCID: PMC4788227 DOI: 10.1534/genetics.115.184572] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2015] [Accepted: 11/30/2015] [Indexed: 11/18/2022] Open
Abstract
We consider resequencing studies of associated loci and the problem of prioritizing sequence variants for functional follow-up. Working within the multivariate linear regression framework helps us to account for the joint effects of multiple genes; and adopting a Bayesian approach leads to posterior probabilities that coherently incorporate all information about the variants' function. We describe two novel prior distributions that facilitate learning the role of each variable site by borrowing evidence across phenotypes and across mutations in the same gene. We illustrate their potential advantages with simulations and reanalyzing a data set of sequencing variants.
Collapse
Affiliation(s)
- Laurel Stell
- Department of Health Research and Policy, Stanford University, Stanford, California 94305
| | - Chiara Sabatti
- Department of Health Research and Policy, Stanford University, Stanford, California 94305 Department of Statistics, Stanford University, Stanford, California 94305
| |
Collapse
|
39
|
Pineda S, Real FX, Kogevinas M, Carrato A, Chanock SJ, Malats N, Van Steen K. Integration Analysis of Three Omics Data Using Penalized Regression Methods: An Application to Bladder Cancer. PLoS Genet 2015; 11:e1005689. [PMID: 26646822 PMCID: PMC4672920 DOI: 10.1371/journal.pgen.1005689] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2015] [Accepted: 10/30/2015] [Indexed: 01/10/2023] Open
Abstract
Omics data integration is becoming necessary to investigate the genomic mechanisms involved in complex diseases. During the integration process, many challenges arise such as data heterogeneity, the smaller number of individuals in comparison to the number of parameters, multicollinearity, and interpretation and validation of results due to their complexity and lack of knowledge about biological processes. To overcome some of these issues, innovative statistical approaches are being developed. In this work, we propose a permutation-based method to concomitantly assess significance and correct by multiple testing with the MaxT algorithm. This was applied with penalized regression methods (LASSO and ENET) when exploring relationships between common genetic variants, DNA methylation and gene expression measured in bladder tumor samples. The overall analysis flow consisted of three steps: (1) SNPs/CpGs were selected per each gene probe within 1Mb window upstream and downstream the gene; (2) LASSO and ENET were applied to assess the association between each expression probe and the selected SNPs/CpGs in three multivariable models (SNP, CPG, and Global models, the latter integrating SNPs and CPGs); and (3) the significance of each model was assessed using the permutation-based MaxT method. We identified 48 genes whose expression levels were significantly associated with both SNPs and CPGs. Importantly, 36 (75%) of them were replicated in an independent data set (TCGA) and the performance of the proposed method was checked with a simulation study. We further support our results with a biological interpretation based on an enrichment analysis. The approach we propose allows reducing computational time and is flexible and easy to implement when analyzing several types of omics data. Our results highlight the importance of integrating omics data by applying appropriate statistical strategies to discover new insights into the complex genetic mechanisms involved in disease conditions. At present, it is already possible to generate different type of omics–high throughput–data in the same individuals. However, we lack methodology to adequately combine them. Many challenges arise while the amount of data increases and we need to find the way to identify and understand the complex relationships when integrating data. In this regard, new statistical approaches are needed, such as the ones we propose and apply here to integrate three types of omics data (genomics, epigenomics, and transcriptomics) generated using bladder cancer tumor samples. These innovative approaches (LASSO and ENET combined with a permutation-based MaxT method) allowed us to find 48 genes whose expression levels were significantly associated with genomics and epigenomics markers. The adequacy of this approach was confirmed by the use of an independent data set from The Cancer Genome Atlas Consortium: 75% of the genes were replicated. Previous sound biological evidences further support the results obtained.
Collapse
Affiliation(s)
- Silvia Pineda
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- Systems and Modeling Unit–BIO3, Montefiore Institute, Liège, Belgium
| | - Francisco X. Real
- Epithelial Carcinogenesis Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, Barcelona, Spain
| | - Manolis Kogevinas
- Centre for Research in Environmental Epidemiology (CREAL) and Parc de Salut Mar, Barcelona, Spain
| | - Alfredo Carrato
- Servicio de Oncología, Hospital Universitario Ramon y Cajal, Madrid, and Servicio de Oncología, Hospital Universitario de Elche, Alicante, Spain
| | - Stephen J. Chanock
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Department of Health and Human Services, Bethesda, Maryland, United States of America
| | - Núria Malats
- Genetic and Molecular Epidemiology Group, Spanish National Cancer Research Centre (CNIO), Madrid, Spain
- * E-mail: (NM); (KVS)
| | - Kristel Van Steen
- Systems and Modeling Unit–BIO3, Montefiore Institute, Liège, Belgium
- Systems Biology and Chemical Biology, GIGA-R, Liège, Belgium
- * E-mail: (NM); (KVS)
| |
Collapse
|
40
|
Qiu C, Gelaye B, Denis M, Tadesse MG, Luque Fernandez MA, Enquobahrie DA, Ananth CV, Sanchez SE, Williams MA. Circadian clock-related genetic risk scores and risk of placental abruption. Placenta 2015; 36:1480-6. [PMID: 26515929 PMCID: PMC5010362 DOI: 10.1016/j.placenta.2015.10.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/16/2015] [Revised: 10/06/2015] [Accepted: 10/11/2015] [Indexed: 10/22/2022]
Abstract
INTRODUCTION The circadian clock plays an important role in several aspects of female reproductive biology. Evidence linking circadian clock-related genes to pregnancy outcomes has been inconsistent. We sought to examine whether variations in single nucleotide polymorphisms (SNPs) of circadian clock genes are associated with PA risk. METHODS Maternal blood samples were collected from 470 PA case and 473 controls. Genotyping was performed using the Illumina Cardio-MetaboChip platform. We examined 119 SNPs in 13 candidate genes known to control circadian rhythms (e.g., CRY2, ARNTL, and RORA). Univariate and penalized logistic regression models were fit to estimate odds ratios (ORs); and the combined effect of multiple SNPs on PA risk was estimated using a weighted genetic risk score (wGRS). RESULTS A common SNP in the RORA gene (rs2899663) was associated with a 21% reduced odds of PA (P < 0.05). The odds of PA increased with increasing wGRS (Ptrend < 0.001). The corresponding ORs were 1.00, 1.83, 2.81 and 5.13 across wGRS quartiles. Participants in the highest wGRS quartile had a 5.13-fold (95% confidence interval: 3.21-8.21) higher odds of PA compared to those in the lowest quartile. Although the test for interaction was not significant, the odds of PA was substantially elevated for preeclamptics with the highest wGRS quartile (OR = 14.44, 95%CI: 6.62-31.53) compared to normotensive women in the lowest wGRS quartile. DISCUSSION Genetic variants in circadian rhythm genes may be associated with PA risk. Larger studies are needed to corroborate these findings and to further elucidate the pathogenesis of this important obstetrical complication.
Collapse
Affiliation(s)
- Chunfang Qiu
- Center for Perinatal Studies, Swedish Medical Center, Seattle, WA, USA.
| | - Bizu Gelaye
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Marie Denis
- UMR Amélioration Génétique et Adaptation des Plantes méditerranéennes et tropicales (AGAP), CIRAD, Montpellier, France
| | - Mahlet G Tadesse
- Department of Mathematics and Statistics, Georgetown University, Washington, DC, USA
| | | | - Daniel A Enquobahrie
- Center for Perinatal Studies, Swedish Medical Center, Seattle, WA, USA; Department of Epidemiology, School of Public Health, University of Washington, Seattle, WA, USA
| | - Cande V Ananth
- Department of Obstetrics and Gynecology, College of Physicians and Surgeons, Columbia University Medical Center, New York, NY, USA; Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, NY, USA
| | - Sixto E Sanchez
- Sección de Post Grado, Facultad de Medicina Humana, Universidad San Martín de Porres, Lima, Peru; A.C. PROESA, Lima, Peru; Department of Obstetrics and Gynecology, San Marcos University, Lima, Peru
| | - Michelle A Williams
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| |
Collapse
|
41
|
Liquet B, Lafaye de Micheaux P, Hejblum BP, Thiébaut R. Group and sparse group partial least square approaches applied in genomics context. Bioinformatics 2015; 32:35-42. [DOI: 10.1093/bioinformatics/btv535] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2015] [Accepted: 09/03/2015] [Indexed: 01/07/2023] Open
|
42
|
Fouladi R, Bessonov K, Van Lishout F, Van Steen K. Model-Based Multifactor Dimensionality Reduction for Rare Variant Association Analysis. Hum Hered 2015. [PMID: 26201701 DOI: 10.1159/000381286] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Genome-wide association studies have revealed a vast amount of common loci associated to human complex diseases. Still, a large proportion of heritability remains unexplained. The extent to which rare genetic variants (RVs) are able to explain a relevant portion of the genetic heritability for complex traits leaves room for several debates and paves the way to the collection of RV databases and the development of novel analytic tools to analyze these. To date, several statistical methods have been proposed to uncover the association of RVs with complex diseases, but none of them is the clear winner in all possible scenarios of study design and assumed underlying disease model. The latter may involve differences in the distributions of effect sizes, proportions of causal variants, and ratios of protective to deleterious variants at distinct regions throughout the genome. Therefore, there is a need for robust scalable methods with acceptable overall performance in terms of power and type I error under various realistic scenarios. In this paper, we propose a novel RV association analysis strategy, which satisfies several of the desired properties that a RV analysis tool should exhibit.
Collapse
Affiliation(s)
- Ramouna Fouladi
- Systems and Modeling Unit, Montefiore Institute, and Bioinformatics and Modeling, GIGA-R, University of Liège, Liège, Belgium
| | | | | | | |
Collapse
|
43
|
Li Y, Nan B, Zhu J. Multivariate sparse group lasso for the multivariate multiple linear regression with an arbitrary group structure. Biometrics 2015; 71:354-63. [PMID: 25732839 DOI: 10.1111/biom.12292] [Citation(s) in RCA: 61] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2013] [Revised: 12/01/2014] [Accepted: 01/01/2015] [Indexed: 11/27/2022]
Abstract
We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.
Collapse
Affiliation(s)
- Yanming Li
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Bin Nan
- Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| | - Ji Zhu
- Department of Statistics, University of Michigan, Ann Arbor, Michigan, 48109, U.S.A
| |
Collapse
|
44
|
Garner C. Confounded by sequencing depth in association studies of rare alleles. Genet Epidemiol 2015; 35:261-8. [PMID: 21328616 DOI: 10.1002/gepi.20574] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2010] [Accepted: 01/12/2011] [Indexed: 11/12/2022]
Abstract
Next-generation DNA sequencing technologies are facilitating large-scale association studies of rare genetic variants. The depth of the sequence read coverage is an important experimental variable in the next-generation technologies and it is a major determinant of the quality of genotype calls generated from sequence data. When case and control samples are sequenced separately or in different proportions across batches, they are unlikely to be matched on sequencing read depth and a differential misclassification of genotypes can result, causing confounding and an increased false-positive rate. Data from Pilot Study 3 of the 1000 Genomes project was used to demonstrate that a difference between the mean sequencing read depth of case and control samples can result in false-positive association for rare and uncommon variants, even when the mean coverage depth exceeds 30× in both groups. The degree of the confounding and inflation in the false-positive rate depended on the extent to which the mean depth was different in the case and control groups. A logistic regression model was used to test for association between case-control status and the cumulative number of alleles in a collapsed set of rare and uncommon variants. Including each individual's mean sequence read depth across the variant sites in the logistic regression model nearly eliminated the confounding effect and the inflated false-positive rate. Furthermore, accounting for the potential error by modeling the probability of the heterozygote genotype calls in the regression analysis had a relatively minor but beneficial effect on the statistical results.
Collapse
Affiliation(s)
- Chad Garner
- Department of Epidemiology, University of California, Irvine, CA 92697-3905, USA.
| |
Collapse
|
45
|
Interactions of early adversity with stress-related gene polymorphisms impact regional brain structure in females. Brain Struct Funct 2015; 221:1667-79. [PMID: 25630611 DOI: 10.1007/s00429-015-0996-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2014] [Accepted: 01/21/2015] [Indexed: 12/17/2022]
Abstract
Early adverse life events (EALs) have been associated with regional thinning of the subgenual cingulate cortex (sgACC), a brain region implicated in the development of disorders of mood and affect, and often comorbid functional pain disorders, such as irritable bowel syndrome (IBS). Regional neuroinflammation related to chronic stress system activation has been suggested as a possible mechanism underlying these neuroplastic changes. However, the interaction of genetic and environmental factors in these changes is poorly understood. The current study aimed to evaluate the interactions of EALs and candidate gene polymorphisms in influencing thickness of the sgACC. 210 female subjects (137 healthy controls; 73 IBS) were genotyped for stress and inflammation-related gene polymorphisms. Genetic variation with EALs, and diagnosis on sgACC thickness was examined, while controlling for race, age, and total brain volume. Compared to HCs, IBS had significantly reduced sgACC thickness (p = 0.03). Regardless of disease group (IBS vs. HC), thinning of the left sgACC was associated with a significant gene-gene environment interaction between the IL-1β genotype, the NR3C1 haplotype, and a history of EALs (p = 0.05). Reduced sgACC thickness in women with the minor IL-1β allele, was associated with EAL total scores regardless of NR3C1 haplotype status (p = 0.02). In subjects homozygous for the major IL-1β allele, reduced sgACC with increasing levels of EALs was seen only with the less common NR3C1 haplotype (p = 0.02). These findings support an interaction between polymorphisms related to stress and inflammation and early adverse life events in modulating a key region of the emotion arousal circuit.
Collapse
|
46
|
Matsui H. SPARSE REGULARIZATION FOR BI-LEVEL VARIABLE SELECTION. JOURNAL JAPANESE SOCIETY OF COMPUTATIONAL STATISTICS 2015. [DOI: 10.5183/jjscs.1502001_216] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
|
47
|
Denis M, Enquobahrie DA, Tadesse MG, Gelaye B, Sanchez SE, Salazar M, Ananth CV, Williams MA. Placental genome and maternal-placental genetic interactions: a genome-wide and candidate gene association study of placental abruption. PLoS One 2014; 9:e116346. [PMID: 25549360 PMCID: PMC4280220 DOI: 10.1371/journal.pone.0116346] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2014] [Accepted: 12/08/2014] [Indexed: 01/02/2023] Open
Abstract
While available evidence supports the role of genetics in the pathogenesis of placental abruption (PA), PA-related placental genome variations and maternal-placental genetic interactions have not been investigated. Maternal blood and placental samples collected from participants in the Peruvian Abruptio Placentae Epidemiology study were genotyped using Illumina's Cardio-Metabochip platform. We examined 118,782 genome-wide SNPs and 333 SNPs in 32 candidate genes from mitochondrial biogenesis and oxidative phosphorylation pathways in placental DNA from 280 PA cases and 244 controls. We assessed maternal-placental interactions in the candidate gene SNPS and two imprinted regions (IGF2/H19 and C19MC). Univariate and penalized logistic regression models were fit to estimate odds ratios. We examined the combined effect of multiple SNPs on PA risk using weighted genetic risk scores (WGRS) with repeated ten-fold cross-validations. A multinomial model was used to investigate maternal-placental genetic interactions. In placental genome-wide and candidate gene analyses, no SNP was significant after false discovery rate correction. The top genome-wide association study (GWAS) hits were rs544201, rs1484464 (CTNNA2), rs4149570 (TNFRSF1A) and rs13055470 (ZNRF3) (p-values: 1.11e-05 to 3.54e-05). The top 200 SNPs of the GWAS overrepresented genes involved in cell cycle, growth and proliferation. The top candidate gene hits were rs16949118 (COX10) and rs7609948 (THRB) (p-values: 6.00e-03 and 8.19e-03). Participants in the highest quartile of WGRS based on cross-validations using SNPs selected from the GWAS and candidate gene analyses had a 8.40-fold (95% CI: 5.8-12.56) and a 4.46-fold (95% CI: 2.94-6.72) higher odds of PA compared to participants in the lowest quartile. We found maternal-placental genetic interactions on PA risk for two SNPs in PPARG (chr3:12313450 and chr3:12412978) and maternal imprinting effects for multiple SNPs in the C19MC and IGF2/H19 regions. Variations in the placental genome and interactions between maternal-placental genetic variations may contribute to PA risk. Larger studies may help advance our understanding of PA pathogenesis.
Collapse
Affiliation(s)
- Marie Denis
- Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America; UMR AGAP (Amélioration Génétique et Adaptation des Plantes méditerranéennes et tropicales), CIRAD, Montpellier, France
| | - Daniel A Enquobahrie
- Center for Perinatal Studies, Swedish Medical Center, Seattle, Washington, United States of America; Department of Epidemiology, School of Public Health, University of Washington, Seattle, Washington, United States of America
| | - Mahlet G Tadesse
- Department of Mathematics and Statistics, Georgetown University, Washington, D.C., United States of America
| | - Bizu Gelaye
- Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America
| | - Sixto E Sanchez
- Sección de Post Grado, Facultad de Medicina Humana, Universidad San Martín de Porres, Lima, Peru; A.C. PROESA, Lima, Peru
| | - Manuel Salazar
- Department of Obstetrics and Gynecology, San Marcos University, Lima, Peru
| | - Cande V Ananth
- Department of Obstetrics and Gynecology, College of Physicians and Surgeons, Columbia University Medical Center, New York, New York, United States of America; Department of Epidemiology, Mailman School of Public Health, Columbia University, New York, New York, United States of America
| | - Michelle A Williams
- Department of Epidemiology, Harvard School of Public Health, Boston, Massachusetts, United States of America
| |
Collapse
|
48
|
Ionita-Laza I, Capanu M, De Rubeis S, McCallum K, Buxbaum JD. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet 2014; 10:e1004729. [PMID: 25502226 PMCID: PMC4263785 DOI: 10.1371/journal.pgen.1004729] [Citation(s) in RCA: 40] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 09/02/2014] [Indexed: 11/18/2022] Open
Abstract
Pinpointing the small number of causal variants among the abundant naturally occurring genetic variation is a difficult challenge, but a crucial one for understanding precise molecular mechanisms of disease and follow-up functional studies. We propose and investigate two complementary statistical approaches for identification of rare causal variants in sequencing studies: a backward elimination procedure based on groupwise association tests, and a hierarchical approach that can integrate sequencing data with diverse functional and evolutionary conservation annotations for individual variants. Using simulations, we show that incorporation of multiple bioinformatic predictors of deleteriousness, such as PolyPhen-2, SIFT and GERP++ scores, can improve the power to discover truly causal variants. As proof of principle, we apply the proposed methods to VPS13B, a gene mutated in the rare neurodevelopmental disorder called Cohen syndrome, and recently reported with recessive variants in autism. We identify a small set of promising candidates for causal variants, including two loss-of-function variants and a rare, homozygous probably-damaging variant that could contribute to autism risk.
Collapse
Affiliation(s)
- Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, New York, United States of America
- * E-mail:
| | - Marinela Capanu
- Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Silvia De Rubeis
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
| | - Kenneth McCallum
- Department of Biostatistics, Columbia University, New York, New York, United States of America
| | - Joseph D. Buxbaum
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
- Departments of Genetics and Genomic Sciences, and Neuroscience, and Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Mindich Child Health and Development Institute, Mount Sinai School of Medicine, New York, New York, United States of America
| |
Collapse
|
49
|
Li J, Zhong W, Li R, Wu R. A FAST ALGORITHM FOR DETECTING GENE-GENE INTERACTIONS IN GENOME-WIDE ASSOCIATION STUDIES. Ann Appl Stat 2014; 8:2292-2318. [PMID: 26457126 DOI: 10.1214/14-aoas771] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
With the recent advent of high-throughput genotyping techniques, genetic data for genome-wide association studies (GWAS) have become increasingly available, which entails the development of efficient and effective statistical approaches. Although many such approaches have been developed and used to identify single-nucleotide polymorphisms (SNPs) that are associated with complex traits or diseases, few are able to detect gene-gene interactions among different SNPs. Genetic interactions, also known as epistasis, have been recognized to play a pivotal role in contributing to the genetic variation of phenotypic traits. However, because of an extremely large number of SNP-SNP combinations in GWAS, the model dimensionality can quickly become so overwhelming that no prevailing variable selection methods are capable of handling this problem. In this paper, we present a statistical framework for characterizing main genetic effects and epistatic interactions in a GWAS study. Specifically, we first propose a two-stage sure independence screening (TS-SIS) procedure and generate a pool of candidate SNPs and interactions, which serve as predictors to explain and predict the phenotypes of a complex trait. We also propose a rates adjusted thresholding estimation (RATE) approach to determine the size of the reduced model selected by an independence screening. Regularization regression methods, such as LASSO or SCAD, are then applied to further identify important genetic effects. Simulation studies show that the TS-SIS procedure is computationally efficient and has an outstanding finite sample performance in selecting potential SNPs as well as gene-gene interactions. We apply the proposed framework to analyze an ultrahigh-dimensional GWAS data set from the Framingham Heart Study, and select 23 active SNPs and 24 active epistatic interactions for the body mass index variation. It shows the capability of our procedure to resolve the complexity of genetic control.
Collapse
Affiliation(s)
- Jiahan Li
- Department of Applied and Computational Mathematics and Statistics University of Notre Dame Notre Dame, Indiana 46556 USA
| | - Wei Zhong
- Institute for Studies in Economics Department of Statistics School of Economics Fujian Key Laboratory of Statistical Science Xiamen University Xiamen, Fujian 361005 China
| | - Runze Li
- The Methodology Center Department of Statistics Pennsylvania State University University Park, Pennsylvania 16802 USA
| | - Rongling Wu
- Center for Statistical Genetics Pennsylvania State University Hershey, Pennsylvania 17033 USA
| |
Collapse
|
50
|
Sabourin J, Nobel AB, Valdar W. Fine-mapping additive and dominant SNP effects using group-LASSO and fractional resample model averaging. Genet Epidemiol 2014; 39:77-88. [PMID: 25417853 DOI: 10.1002/gepi.21869] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2014] [Revised: 09/25/2014] [Accepted: 09/30/2014] [Indexed: 12/28/2022]
Abstract
Genomewide association studies (GWAS) sometimes identify loci at which both the number and identities of the underlying causal variants are ambiguous. In such cases, statistical methods that model effects of multiple single-nucleotide polymorphisms (SNPs) simultaneously can help disentangle the observed patterns of association and provide information about how those SNPs could be prioritized for follow-up studies. Current multi-SNP methods, however, tend to assume that SNP effects are well captured by additive genetics; yet when genetic dominance is present, this assumption translates to reduced power and faulty prioritizations. We describe a statistical procedure for prioritizing SNPs at GWAS loci that efficiently models both additive and dominance effects. Our method, LLARRMA-dawg, combines a group LASSO procedure for sparse modeling of multiple SNP effects with a resampling procedure based on fractional observation weights. It estimates for each SNP the robustness of association with the phenotype both to sampling variation and to competing explanations from other SNPs. In producing an SNP prioritization that best identifies underlying true signals, we show the following: our method easily outperforms a single-marker analysis; when additive-only signals are present, our joint model for additive and dominance is equivalent to or only slightly less powerful than modeling additive-only effects; and when dominance signals are present, even in combination with substantial additive effects, our joint model is unequivocally more powerful than a model assuming additivity. We also describe how performance can be improved through calibrated randomized penalization, and discuss how dominance in ungenotyped SNPs can be incorporated through either heterozygote dosage or multiple imputation.
Collapse
Affiliation(s)
- Jeremy Sabourin
- Department of Genetics, University of North Carolina at Chapel Hill, North Carolina, United States of America; Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, North Carolina, United States of America
| | | | | |
Collapse
|