1
|
Gaye A, Diongue AK, Komen LN, Diallo A, Sylla SN, Diarra M, Talla C, Loucoubar C. High-dimensional supervised classification in a context of non-independence of observations to identify the determining SNPs in a phenotype. Infect Dis Model 2023; 8:1079-1087. [PMID: 37727806 PMCID: PMC10505671 DOI: 10.1016/j.idm.2023.09.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 08/29/2023] [Accepted: 09/03/2023] [Indexed: 09/21/2023] Open
Abstract
This work addresses the problem of supervised classification for highly correlated high-dimensional data describing non-independent observations to identify SNPs related to a phenotype. We use a general penalized linear mixed model with a single random effect that performs simultaneous SNP selection and population structure adjustment in high-dimensional prediction models. Specifically, the model simultaneously selects variables and estimates their effects, taking into account correlations between individuals. Single nucleotide polymorphisms (SNPs) are a type of genetic variation and each SNP represents a difference in a single DNA building block, namely a nucleotide. Previous research has shown that SNPs can be used to identify the correct source population of an individual and can act in isolation or simultaneously to impact a phenotype. In this regard, the study of the contribution of genetics in infectious disease phenotypes is of great importance. In this study, we used uncorrelated variables from the construction of blocks of correlated variables done in a previous work to describe the most related observations of the dataset. The model was trained with 90% of the observations and tested with the remaining 10%. The best model obtained with the generalized information criterion (GIC) identified the SNP named rs2493311 located on the first chromosome of the gene called PRDM16 ((PR/SET domain 16)) as the most decisive factor in malaria attacks.
Collapse
Affiliation(s)
- Aboubacry Gaye
- Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis, Senegal
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Abdou Ka Diongue
- Laboratory for Studies and Research in Statistics and Development, Gaston Berger University of Saint Louis, Senegal
| | | | - Amadou Diallo
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Seydou Nourou Sylla
- Information and Communication Technologies for Development, Alioune Diop University of Bambey, Senegal
| | - Maryam Diarra
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Cheikh Talla
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| | - Cheikh Loucoubar
- Epidemiology, Clinical Research and Data Science Unit, Institute Pasteur de Dakar, 220, Dakar, Senegal
| |
Collapse
|
2
|
Bhatnagar SR, Yang Y, Lu T, Schurr E, Loredo-Osti JC, Forest M, Oualkacha K, Greenwood CMT. Simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. PLoS Genet 2020; 16:e1008766. [PMID: 32365090 PMCID: PMC7224575 DOI: 10.1371/journal.pgen.1008766] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2019] [Revised: 05/14/2020] [Accepted: 04/08/2020] [Indexed: 12/23/2022] Open
Abstract
Complex traits are known to be influenced by a combination of environmental factors and rare and common genetic variants. However, detection of such multivariate associations can be compromised by low statistical power and confounding by population structure. Linear mixed effects models (LMM) can account for correlations due to relatedness but have not been applicable in high-dimensional (HD) settings where the number of fixed effect predictors greatly exceeds the number of samples. False positives or false negatives can result from two-stage approaches, where the residuals estimated from a null model adjusted for the subjects' relationship structure are subsequently used as the response in a standard penalized regression model. To overcome these challenges, we develop a general penalized LMM with a single random effect called ggmix for simultaneous SNP selection and adjustment for population structure in high dimensional prediction models. We develop a blockwise coordinate descent algorithm with automatic tuning parameter selection which is highly scalable, computationally efficient and has theoretical guarantees of convergence. Through simulations and three real data examples, we show that ggmix leads to more parsimonious models compared to the two-stage approach or principal component adjustment with better prediction accuracy. Our method performs well even in the presence of highly correlated markers, and when the causal SNPs are included in the kinship matrix. ggmix can be used to construct polygenic risk scores and select instrumental variables in Mendelian randomization studies. Our algorithms are available in an R package available on CRAN (https://cran.r-project.org/package=ggmix).
Collapse
Affiliation(s)
- Sahir R. Bhatnagar
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Department of Diagnostic Radiology, McGill University, Montréal, Québec, Canada
| | - Yi Yang
- Department of Mathematics and Statistics, McGill University, Montréal, Québec, Canada
| | - Tianyuan Lu
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
| | - Erwin Schurr
- Department of Medicine, McGill University, Montréal, Québec, Canada
| | - JC Loredo-Osti
- Department of Mathematics and Statistics, Memorial University, St. John’s, Newfoundland and Labrador, Canada
| | - Marie Forest
- École de Technologie Supérieure, Montréal, Québec, Canada
| | - Karim Oualkacha
- Département de Mathématiques, Université du Québec à Montréal, Montréal, Québec, Canada
| | - Celia M. T. Greenwood
- Department of Epidemiology, Biostatistics and Occupational Health, McGill University, Montréal, Québec, Canada
- Quantitative Life Sciences, McGill University, Montréal, Québec, Canada
- Lady Davis Institute, Jewish General Hospital, Montréal, Québec, Canada
- Gerald Bronfman Department of Oncology, McGill University, Montréal, Québec, Canada
- Department of Human Genetics, McGill University, Montréal, Québec, Canada
| |
Collapse
|
3
|
Papachristou C, Ober C, Abney M. A LASSO penalized regression approach for genome-wide association analyses using related individuals: application to the Genetic Analysis Workshop 19 simulated data. BMC Proc 2016; 10:221-226. [PMID: 27980640 PMCID: PMC5133525 DOI: 10.1186/s12919-016-0034-9] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We propose a novel LASSO (least absolute shrinkage and selection operator) penalized regression method used to analyze samples consisting of (potentially) related individuals. Developed in the context of linear mixed models, our method models the relatedness of individuals in the sample through a random effect whose covariance structure is a linear function of known matrices with elements combinations of the condensed coefficients of identity between the individuals in the sample. We implement our method to analyze the simulated family data provided by the 19th Genetic Analysis Workshop in an effort to identify loci regulating the simulated trait of systolic blood pressure. The analyses were performed with full knowledge of the simulation model. Our findings demonstrate that we can significantly reduce the rate of false positive signals by incorporating the relatedness of the study participants.
Collapse
Affiliation(s)
- Charalampos Papachristou
- Department of Mathematics, Physics, and Statistics, University of the Sciences, 600 S. 43rd Street, Philadelphia, PA 19104 USA
- Department of Mathematics, Rowan University, 201 Mullica Hill Road, Glassboro, NJ 08028 USA
| | - Carole Ober
- Department of Human Genetics, University of Chicago, 920 E. 58th Street, CLSC 4th floor, Chicago, IL 60637 USA
| | - Mark Abney
- Department of Human Genetics, University of Chicago, 920 E. 58th Street, CLSC 4th floor, Chicago, IL 60637 USA
| |
Collapse
|
4
|
Li D, Zhou J, Thomas DC, Fardo DW. Complex pedigrees in the sequencing era: to track transmissions or decorrelate? Genet Epidemiol 2014; 38 Suppl 1:S29-36. [PMID: 25112185 DOI: 10.1002/gepi.21822] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Next-generation sequencing (NGS) studies are becoming commonplace, and the NGS field is continuing to develop rapidly. Analytic methods aimed at testing for the various roles that genetic susceptibility plays in disease are also rapidly being developed and optimized. Studies that incorporate large, complex pedigrees are of particular importance because they provide detailed information about inheritance patterns and can be analyzed in a variety of complementary ways. The nine contributions from our Genetic Analysis Workshop 18 working group on family-based tests of association for rare variants using simulated data examined analytic methods for testing genetic association using whole-genome sequencing data from 20 large pedigrees with 200 phenotype simulation replicates. What distinguishes the approaches explored is how the complexities of analyzing familial genetic data were handled. Here, we explore the methods that either harness inheritance patterns and transmission information or attempt to adjust for the correlation between family members in order to utilize computationally and conceptually simpler statistical testing procedures. Although directly comparing these two classes of approaches across contributions is difficult, we note that the two classes balance robustness to population stratification and computational complexity (the transmission-based approaches) with simplicity and increased power, assuming no population stratification or proper adjustment for it (decorrelation approaches).
Collapse
Affiliation(s)
- Dalin Li
- Medical Genetics Institute, Cedars-Sinai Medical Center, Los Angeles, California, United States of America; David Geffen School of Medicine, University of California Los Angeles, Los Angeles, California, United States of America
| | | | | | | |
Collapse
|