1
|
Boutry S, Helaers R, Lenaerts T, Vikkula M. Excalibur: A new ensemble method based on an optimal combination of aggregation tests for rare-variant association testing for sequencing data. PLoS Comput Biol 2023; 19:e1011488. [PMID: 37708232 PMCID: PMC10522036 DOI: 10.1371/journal.pcbi.1011488] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Revised: 09/26/2023] [Accepted: 09/04/2023] [Indexed: 09/16/2023] Open
Abstract
The development of high-throughput next-generation sequencing technologies and large-scale genetic association studies produced numerous advances in the biostatistics field. Various aggregation tests, i.e. statistical methods that analyze associations of a trait with multiple markers within a genomic region, have produced a variety of novel discoveries. Notwithstanding their usefulness, there is no single test that fits all needs, each suffering from specific drawbacks. Selecting the right aggregation test, while considering an unknown underlying genetic model of the disease, remains an important challenge. Here we propose a new ensemble method, called Excalibur, based on an optimal combination of 36 aggregation tests created after an in-depth study of the limitations of each test and their impact on the quality of result. Our findings demonstrate the ability of our method to control type I error and illustrate that it offers the best average power across all scenarios. The proposed method allows for novel advances in Whole Exome/Genome sequencing association studies, able to handle a wide range of association models, providing researchers with an optimal aggregation analysis for the genetic regions of interest.
Collapse
Affiliation(s)
- Simon Boutry
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
| | - Raphaël Helaers
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussels, Brussels, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Brussels, Belgium
- Artificial Intelligence laboratory, Vrije Universiteit Brussel, Brussels, Belgium
| | - Miikka Vikkula
- Human Molecular Genetics, de Duve Institute, University of Louvain, Brussels, Belgium
- WELBIO department, WEL Research Institute, Wavre, Belgium
| |
Collapse
|
2
|
Wu Y, Xiao N, Chen Y, Yu L, Pan C, Li Y, Zhang X, Huang N, Ji H, Dai Z, Chen X, Li A. Comprehensive evaluation of resistance effects of pyramiding lines with different broad-spectrum resistance genes against Magnaporthe oryzae in rice (Oryza sativa L.). RICE (NEW YORK, N.Y.) 2019; 12:11. [PMID: 30825053 PMCID: PMC6397272 DOI: 10.1186/s12284-019-0264-3] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/10/2018] [Accepted: 01/17/2019] [Indexed: 05/13/2023]
Abstract
BACKGROUND Broad-spectrum resistance gene pyramiding helps the development of varieties with broad-spectrum and durable resistance to M. oryzae. However, detailed information about how these different sources of broad-spectrum resistance genes act together or what are the best combinations to achieve broad-spectrum and durable resistance is limited. RESULTS Here a set of fifteen different polygene pyramiding lines (PPLs) were constructed using marker-assisted selection (MAS). Using artificial inoculation assays at seedling and heading stage, combined with natural induction identification under multiple field environments, we evaluated systematically the resistance effects of different alleles of Piz locus (Pigm, Pi40, Pi9, Pi2 and Piz) combined with Pi1, Pi33 and Pi54, respectively, and the interaction effects between different R genes. The results showed that the seedling blast and panicle blast resistance levels of PPLs were significantly higher than that of monogenic lines. The main reason was that most of the gene combinations produced transgressive heterosis, and the transgressive heterosis for panicle blast resistance produced by most of PPLs was higher than that of seedling blast resistance. Different gene pyramiding with broad-spectrum R gene produced different interaction effects, among them, the overlapping effect (OE) between R genes could significantly improve the seedling blast resistance level of PPLs, while the panicle blast resistance of PPLs were remarkably correlated with OE and complementary effect (CE). In addition, we found that gene combinations, Pigm/Pi1, Pigm/Pi54 and Pigm/Pi33 displayed broad-spectrum resistance in artificial inoculation at seedling and heading stage, and displayed stable broad-spectrum resistance under different disease nursery. Besides, agronomic traits evaluation also showed PPLs with these three gene combinations were at par to the recurrent parent. Therefore, it would provide elite gene combination model and germplasms for rice blast resistance breeding program. CONCLUSIONS The development of PPLs and interaction effect analysis in this study provides valuable theoretical foundation and innovative resources for breeding broad-spectrum and durable resistant varieties.
Collapse
Affiliation(s)
- Yunyu Wu
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Ning Xiao
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Yu Chen
- Colleges of Horticulture and Plant Protection, Yangzhou University, Yangzhou, 225009, China
| | - Ling Yu
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Cunhong Pan
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing, 210095, China
| | - Yuhong Li
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Xiaoxiang Zhang
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Niansheng Huang
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Hongjuan Ji
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China
| | - Zhengyuan Dai
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China
- Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing, 210095, China
| | - Xijun Chen
- Colleges of Horticulture and Plant Protection, Yangzhou University, Yangzhou, 225009, China.
| | - Aihong Li
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225009, China.
- Jiangsu Collaborative Innovation Center for Modern Crop Production, Nanjing, 210095, China.
- Jiangsu Key Laboratory of Crop Genomics and Molecular Breeding, Yangzhou University, Yangzhou, 225009, China.
| |
Collapse
|
3
|
Karunarathna CB, Graham J. Using Gene Genealogies to Localize Rare Variants Associated with Complex Traits in Diploid Populations. Hum Hered 2018; 83:30-39. [PMID: 29763929 DOI: 10.1159/000486854] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2017] [Accepted: 01/16/2018] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND AND AIMS Many methods can detect trait association with causal variants in candidate genomic regions; however, a comparison of their ability to localize causal variants is lacking. We extend a previous study of the detection abilities of these methods to a comparison of their localization abilities. METHODS Through coalescent simulation, we compare several popular association methods. Cases and controls are sampled from a diploid population to mimic human studies. As benchmarks for comparison, we include two methods that cluster phenotypes on the true genealogical trees: a naive Mantel test considered previously in haploid populations and an extension that takes into account whether case haplotypes carry a causal variant. We first work through a simulated dataset to illustrate the methods. We then perform a simulation study to score the localization and detection properties. RESULTS In our simulations, the association signal was localized least precisely by the naive Mantel test and most precisely by its extension. Most other approaches had intermediate performance similar to the single-variant Fisher exact test. CONCLUSIONS Our results confirm earlier findings in haploid populations about potential gains in performance from genealogy-based approaches. They also highlight differences between haploid and diploid populations when localizing and detecting causal variants.
Collapse
|
4
|
Konigorski S, Yilmaz YE, Pischon T. Comparison of single-marker and multi-marker tests in rare variant association studies of quantitative traits. PLoS One 2017; 12:e0178504. [PMID: 28562689 PMCID: PMC5451057 DOI: 10.1371/journal.pone.0178504] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2017] [Accepted: 05/15/2017] [Indexed: 11/19/2022] Open
Abstract
In genetic association studies of rare variants, low statistical power and potential violations of established estimator properties are among the main challenges of association tests. Multi-marker tests (MMTs) have been proposed to target these challenges, but any comparison with single-marker tests (SMTs) has to consider that their aim is to identify causal genomic regions instead of variants. Valid power comparisons have been performed for the analysis of binary traits indicating that MMTs have higher power, but there is a lack of conclusive studies for quantitative traits. The aim of our study was therefore to fairly compare SMTs and MMTs in their empirical power to identify the same causal loci associated with a quantitative trait. The results of extensive simulation studies indicate that previous results for binary traits cannot be generalized. First, we show that for the analysis of quantitative traits, conventional estimation methods and test statistics of single-marker approaches have valid properties yielding association tests with valid type I error, even when investigating singletons or doubletons. Furthermore, SMTs lead to more powerful association tests for identifying causal genes than MMTs when the effect sizes of causal variants are large, and less powerful tests when causal variants have small effect sizes. For moderate effect sizes, whether SMTs or MMTs have higher power depends on the sample size and percentage of causal SNVs. For a more complete picture, we also compare the power in studies of quantitative and binary traits, and the power to identify causal genes with the power to identify causal rare variants. In a genetic association analysis of systolic blood pressure in the Genetic Analysis Workshop 19 data, SMTs yielded smaller p-values compared to MMTs for most of the investigated blood pressure genes, and were least influenced by the definition of gene regions.
Collapse
Affiliation(s)
- Stefan Konigorski
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
| | - Yildiz E. Yilmaz
- Department of Mathematics and Statistics, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Genetics, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
- Discipline of Medicine, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, Newfoundland and Labrador, Canada
| | - Tobias Pischon
- Molecular Epidemiology Research Group, Max Delbrück Center (MDC) for Molecular Medicine in the Helmholtz Association, Berlin, Germany
- Charité Universitätsmedizin Berlin, Berlin, Germany
- DZHK (German Center for Cardiovascular Research), Berlin, Germany
| |
Collapse
|
5
|
Abstract
Background Recent advances in next-generation sequencing technologies have made it possible to generate large amounts of sequence data with rare variants in a cost-effective way. Yet, the statistical aspect of testing disease association of rare variants is quite challenging as the typical assumptions fail to hold owing to low minor allele frequency (<0.5 or 1 %). Methods I present a Bayesian variable selection approach to detect associations with both rare and common genetic variants for quantitative traits simultaneously. In my model, I frame the problem of identifying disease-associated variants as a problem of variable selection in a sparse space, that is, how best to model the relationship between phenotypes and a set of genetic variants. By constructing a risk index score for a group of rare variants, my method can effectively consider all variants in a multivariate model. I also use a within-chain permutation to generate the empirical thresholds to detect true-positive variants. Results I apply our method to study the association between increases in baseline systolic and diastolic blood pressure (SBP and DBP, respectively) and genetic variants in the data from Genetic Analysis Workshop 19 unrelated samples. I identify several rare and common variants in the gene MAP4 that are potentially associated with SBP and DBP. Conclusions The application shows that my method is powerful in identifying disease-associated variants even with the extreme rarity.
Collapse
|
6
|
Dumancas GG, Ramasahayam S, Bello G, Hughes J, Kramer R. Chemometric regression techniques as emerging, powerful tools in genetic association studies. Trends Analyt Chem 2015. [DOI: 10.1016/j.trac.2015.05.007] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
|
7
|
Wu Y, Xiao N, Yu L, Pan C, Li Y, Zhang X, Liu G, Dai Z, Pan X, Li A. Combination Patterns of Major R Genes Determine the Level of Resistance to the M. oryzae in Rice (Oryza sativa L.). PLoS One 2015; 10:e0126130. [PMID: 26030358 PMCID: PMC4452627 DOI: 10.1371/journal.pone.0126130] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2015] [Accepted: 03/29/2015] [Indexed: 11/18/2022] Open
Abstract
Rice blast caused by Magnaporthe oryzae is the most devastating disease of rice and poses a serious threat to world food security. In this study, the distribution and effectiveness of 18 R genes in 277 accessions were investigated based on pathogenicity assays and molecular markers. The results showed that most of the accessions exhibited some degree of resistance (resistance frequency, RF >50%). Accordingly, most of the accessions were observed to harbor two or more R genes, and the number of R genes harbored in accessions was significantly positively correlated with RF. Some R genes were demonstrated to be specifically distributed in the genomes of rice sub-species, such as Pigm, Pi9, Pi5 and Pi1, which were only detected in indica-type accessions, and Pik and Piz, which were just harbored in japonica-type accessions. By analyzing the relationship between R genes and RF using a multiple stepwise regression model, the R genes Pid3, Pi5, Pi9, Pi54, Pigm and Pit were found to show the main effects against M. oryzae in indica-type accessions, while Pita, Pb1, Pik, Pizt and Pia were indicated to exhibit the main effects against M. oryzae in japonica-type accessions. Principal component analysis (PCA) and cluster analysis revealed that combination patterns of major R genes were the main factors determining the resistance of rice varieties to M. oryzae, such as 'Pi9+Pi54', 'Pid3+Pigm', 'Pi5+Pid3+Pigm', 'Pi5+Pi54+Pid3+Pigm', 'Pi5+Pid3' and 'Pi5+Pit+Pid3' in indica-type accessions and 'Pik+Pib', 'Pik+Pita', 'Pik+Pb1', 'Pizt+Pia' and 'Pizt+Pita' in japonica-type accessions, which were able to confer effective resistance against M. oryzae. The above results provide good theoretical support for the rational utilization of combinations of major R genes in developing rice cultivars with broad-spectrum resistance.
Collapse
Affiliation(s)
- Yunyu Wu
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
- Key Laboratory of Plant Functional Genomics, Ministry of Education, Yangzhou University, Yangzhou, 225009, P.R. China
| | - Ning Xiao
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Ling Yu
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Cunhong Pan
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Yuhong Li
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Xiaoxiang Zhang
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Guangqing Liu
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Zhengyuan Dai
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
| | - Xuebiao Pan
- Key Laboratory of Plant Functional Genomics, Ministry of Education, Yangzhou University, Yangzhou, 225009, P.R. China
- * E-mail: (XBP); (AHL)
| | - Aihong Li
- Lixiahe Agricultural Research Institute of Jiangsu Province, Yangzhou, 225007, P.R. China
- * E-mail: (XBP); (AHL)
| |
Collapse
|
8
|
Ionita-Laza I, Capanu M, De Rubeis S, McCallum K, Buxbaum JD. Identification of rare causal variants in sequence-based studies: methods and applications to VPS13B, a gene involved in Cohen syndrome and autism. PLoS Genet 2014; 10:e1004729. [PMID: 25502226 PMCID: PMC4263785 DOI: 10.1371/journal.pgen.1004729] [Citation(s) in RCA: 40] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2014] [Accepted: 09/02/2014] [Indexed: 11/18/2022] Open
Abstract
Pinpointing the small number of causal variants among the abundant naturally occurring genetic variation is a difficult challenge, but a crucial one for understanding precise molecular mechanisms of disease and follow-up functional studies. We propose and investigate two complementary statistical approaches for identification of rare causal variants in sequencing studies: a backward elimination procedure based on groupwise association tests, and a hierarchical approach that can integrate sequencing data with diverse functional and evolutionary conservation annotations for individual variants. Using simulations, we show that incorporation of multiple bioinformatic predictors of deleteriousness, such as PolyPhen-2, SIFT and GERP++ scores, can improve the power to discover truly causal variants. As proof of principle, we apply the proposed methods to VPS13B, a gene mutated in the rare neurodevelopmental disorder called Cohen syndrome, and recently reported with recessive variants in autism. We identify a small set of promising candidates for causal variants, including two loss-of-function variants and a rare, homozygous probably-damaging variant that could contribute to autism risk.
Collapse
Affiliation(s)
- Iuliana Ionita-Laza
- Department of Biostatistics, Columbia University, New York, New York, United States of America
- * E-mail:
| | - Marinela Capanu
- Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| | - Silvia De Rubeis
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
| | - Kenneth McCallum
- Department of Biostatistics, Columbia University, New York, New York, United States of America
| | - Joseph D. Buxbaum
- Seaver Autism Center for Research and Treatment, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Departments of Psychiatry, Mount Sinai School of Medicine, New York, New York, United States of America
- Departments of Genetics and Genomic Sciences, and Neuroscience, and Friedman Brain Institute, Icahn School of Medicine at Mount Sinai, New York, New York, United States of America
- Mindich Child Health and Development Institute, Mount Sinai School of Medicine, New York, New York, United States of America
| |
Collapse
|
9
|
He L, Pitkäniemi J, Sarin AP, Salomaa V, Sillanpää MJ, Ripatti S. Hierarchical Bayesian model for rare variant association analysis integrating genotype uncertainty in human sequence data. Genet Epidemiol 2014; 39:89-100. [PMID: 25395270 DOI: 10.1002/gepi.21871] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2014] [Revised: 09/18/2014] [Accepted: 10/03/2014] [Indexed: 11/08/2022]
Abstract
Next-generation sequencing (NGS) has led to the study of rare genetic variants, which possibly explain the missing heritability for complex diseases. Most existing methods for rare variant (RV) association detection do not account for the common presence of sequencing errors in NGS data. The errors can largely affect the power and perturb the accuracy of association tests due to rare observations of minor alleles. We developed a hierarchical Bayesian approach to estimate the association between RVs and complex diseases. Our integrated framework combines the misclassification probability with shrinkage-based Bayesian variable selection. It allows for flexibility in handling neutral and protective RVs with measurement error, and is robust enough for detecting causal RVs with a wide spectrum of minor allele frequency (MAF). Imputation uncertainty and MAF are incorporated into the integrated framework to achieve the optimal statistical power. We demonstrate that sequencing error does significantly affect the findings, and our proposed model can take advantage of it to improve statistical power in both simulated and real data. We further show that our model outperforms existing methods, such as sequence kernel association test (SKAT). Finally, we illustrate the behavior of the proposed method using a Finnish low-density lipoprotein cholesterol study, and show that it identifies an RV known as FH North Karelia in LDLR gene with three carriers in 1,155 individuals, which is missed by both SKAT and Granvil.
Collapse
Affiliation(s)
- Liang He
- Department of Public Health, Hjelt Institute, University of Helsinki, Helsinki, Finland
| | | | | | | | | | | |
Collapse
|
10
|
Liu X, Beyene J. Gene-based analysis of rare and common variants to determine association with blood pressure. BMC Proc 2014; 8:S46. [PMID: 25519387 PMCID: PMC4143676 DOI: 10.1186/1753-6561-8-s1-s46] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023] Open
Abstract
Systolic blood pressure and diastolic blood pressure are known risk factors for cardiovascular diseases and understanding their genetic basis will have important public health implications. For rare variants, it is extremely challenging to make statistical inference for single-maker tests. Therefore, joint analysis of a set of variants has been proposed. In this paper, we applied recently proposed methods "test for testing the effect of an optimally weighted combination of variants" and "variable weight-TOW" to determine genetic regions that are associated with blood pressure. Then least absolute shrinkage and selection operator, as well as sparse partial least square methods, were used to identify significant markers within a gene or in intergenic regions. We investigated the effect of rare variants and common variants, and their combined effect.
Collapse
Affiliation(s)
- Xiaofeng Liu
- Population Genomic Program, Department of Clinical Epidemiology & Biostatistics, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S4K1, Canada
| | - Joseph Beyene
- Population Genomic Program, Department of Clinical Epidemiology & Biostatistics, McMaster University, 1280 Main Street West, Hamilton, Ontario L8S4K1, Canada
| |
Collapse
|
11
|
King CR, Nicolae DL. GWAS to Sequencing: Divergence in Study Design and Analysis. Genes (Basel) 2014; 5:460-76. [PMID: 24879455 PMCID: PMC4094943 DOI: 10.3390/genes5020460] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2013] [Revised: 05/13/2014] [Accepted: 05/15/2014] [Indexed: 12/03/2022] Open
Abstract
The success of genome-wide association studies (GWAS) in uncovering genetic risk factors for complex traits has generated great promise for the complete data generated by sequencing. The bumpy transition from GWAS to whole-exome or whole-genome association studies (WGAS) based on sequencing investigations has highlighted important differences in analysis and interpretation. We show how the loss in power due to the allele frequency spectrum targeted by sequencing is difficult to compensate for with realistic effect sizes and point to study designs that may help. We discuss several issues in interpreting the results, including a special case of the winner's curse. Extrapolation and prediction using rare SNPs is complex, because of the selective ascertainment of SNPs in case-control studies and the low amount of information at each SNP, and naive procedures are biased under the alternative. We also discuss the challenges in tuning gene-based tests and accounting for multiple testing when genes have very different sets of SNPs. The examples we emphasize in this paper highlight the difficult road we must travel for a two-letter switch.
Collapse
Affiliation(s)
| | - Dan L Nicolae
- Departments of Medicine, Statistics, and Human Genetics, University of Chicago, Chicago,IL 60637, USA.
| |
Collapse
|
12
|
Byrnes AE, Wu MC, Wright FA, Li M, Li Y. The value of statistical or bioinformatics annotation for rare variant association with quantitative trait. Genet Epidemiol 2013; 37:666-74. [PMID: 23836599 PMCID: PMC4083762 DOI: 10.1002/gepi.21747] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2013] [Revised: 05/20/2013] [Accepted: 06/03/2013] [Indexed: 11/06/2022]
Abstract
In the past few years, a plethora of methods for rare variant association with phenotype have been proposed. These methods aggregate information from multiple rare variants across genomic region(s), but there is little consensus as to which method is most effective. The weighting scheme adopted when aggregating information across variants is one of the primary determinants of effectiveness. Here we present a systematic evaluation of multiple weighting schemes through a series of simulations intended to mimic large sequencing studies of a quantitative trait. We evaluate existing phenotype-independent and phenotype-dependent methods, as well as weights estimated by penalized regression approaches including Lasso, Elastic Net, and SCAD. We find that the difference in power between phenotype-dependent schemes is negligible when high-quality functional annotations are available. When functional annotations are unavailable or incomplete, all methods suffer from power loss; however, the variable selection methods outperform the others at the cost of increased computational time. Therefore, in the absence of good annotation, we recommend variable selection methods (which can be viewed as "statistical annotation") on top of regions implicated by a phenotype-independent weighting scheme. Further, once a region is implicated, variable selection can help to identify potential causal single nucleotide polymorphisms for biological validation. These findings are supported by an analysis of a high coverage targeted sequencing study of 1,898 individuals.
Collapse
Affiliation(s)
- Andrea E. Byrnes
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Michael C. Wu
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Fred A. Wright
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
| | - Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, Philadelphia, PA 19104
| | - Yun Li
- Department of Biostatistics, University of North Carolina, Chapel Hill, North Carolina 27599
- Department of Genetics, University of North Carolina, Chapel Hill, North Carolina 27599
- Department of Computer Science, University of North Carolina, Chapel Hill, North Carolina 27599
| |
Collapse
|
13
|
|
14
|
Diabetic retinopathy risk prediction for fundus examination using sparse learning: a cross-sectional study. BMC Med Inform Decis Mak 2013; 13:106. [PMID: 24033926 PMCID: PMC3847617 DOI: 10.1186/1472-6947-13-106] [Citation(s) in RCA: 44] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2013] [Accepted: 09/02/2013] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND Blindness due to diabetic retinopathy (DR) is the major disability in diabetic patients. Although early management has shown to prevent vision loss, diabetic patients have a low rate of routine ophthalmologic examination. Hence, we developed and validated sparse learning models with the aim of identifying the risk of DR in diabetic patients. METHODS Health records from the Korea National Health and Nutrition Examination Surveys (KNHANES) V-1 were used. The prediction models for DR were constructed using data from 327 diabetic patients, and were validated internally on 163 patients in the KNHANES V-1. External validation was performed using 562 diabetic patients in the KNHANES V-2. The learning models, including ridge, elastic net, and LASSO, were compared to the traditional indicators of DR. RESULTS Considering the Bayesian information criterion, LASSO predicted DR most efficiently. In the internal and external validation, LASSO was significantly superior to the traditional indicators by calculating the area under the curve (AUC) of the receiver operating characteristic. LASSO showed an AUC of 0.81 and an accuracy of 73.6% in the internal validation, and an AUC of 0.82 and an accuracy of 75.2% in the external validation. CONCLUSION The sparse learning model using LASSO was effective in analyzing the epidemiological underlying patterns of DR. This is the first study to develop a machine learning model to predict DR risk using health records. LASSO can be an excellent choice when both discriminative power and variable selection are important in the analysis of high-dimensional electronic health records.
Collapse
|