1
|
Chen W, Coombes BJ, Larson NB. Recent advances and challenges of rare variant association analysis in the biobank sequencing era. Front Genet 2022; 13:1014947. [PMID: 36276986 PMCID: PMC9582646 DOI: 10.3389/fgene.2022.1014947] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Accepted: 09/22/2022] [Indexed: 12/04/2022] Open
Abstract
Causal variants for rare genetic diseases are often rare in the general population. Rare variants may also contribute to common complex traits and can have much larger per-allele effect sizes than common variants, although power to detect these associations can be limited. Sequencing costs have steadily declined with technological advancements, making it feasible to adopt whole-exome and whole-genome profiling for large biobank-scale sample sizes. These large amounts of sequencing data provide both opportunities and challenges for rare-variant association analysis. Herein, we review the basic concepts of rare-variant analysis methods, the current state-of-the-art methods in utilizing variant annotations or external controls to improve the statistical power, and particular challenges facing rare variant analysis such as accounting for population structure, extremely unbalanced case-control design. We also review recent advances and challenges in rare variant analysis for familial sequencing data and for more complex phenotypes such as survival data. Finally, we discuss other potential directions for further methodology investigation.
Collapse
Affiliation(s)
- Wenan Chen
- Center for Applied Bioinformatics, St. Jude Children’s Research Hospital, Memphis, TN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Brandon J. Coombes
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| | - Nicholas B. Larson
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
- *Correspondence: Wenan Chen, ; Brandon J. Coombes, ; Nicholas B. Larson,
| |
Collapse
|
2
|
The association between FTO polymorphisms and type 2 diabetes in Asian populations: A meta-analysis. Meta Gene 2021. [DOI: 10.1016/j.mgene.2021.100958] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
|
3
|
Sun TH, Shao YHJ, Mao CL, Hung MN, Lo YY, Ko TM, Hsiao TH. A Novel Quality-Control Procedure to Improve the Accuracy of Rare Variant Calling in SNP Arrays. Front Genet 2021; 12:736390. [PMID: 34764980 PMCID: PMC8577504 DOI: 10.3389/fgene.2021.736390] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Accepted: 09/21/2021] [Indexed: 12/16/2022] Open
Abstract
Background: Single-nucleotide polymorphism (SNP) arrays are an ideal technology for genotyping genetic variants in mass screening. However, using SNP arrays to detect rare variants [with a minor allele frequency (MAF) of <1%] is still a challenge because of noise signals and batch effects. An approach that improves the genotyping quality is needed for clinical applications. Methods: We developed a quality-control procedure for rare variants which integrates different algorithms, filters, and experiments to increase the accuracy of variant calling. Using data from the TWB 2.0 custom Axiom array, we adopted an advanced normalization adjustment to prevent false calls caused by splitting the cluster and a rare het adjustment which decreases false calls in rare variants. The concordance of allelic frequencies from array data was compared to those from sequencing datasets of Taiwanese. Finally, genotyping results were used to detect familial hypercholesterolemia (FH), thrombophilia (TH), and maturity-onset diabetes of the young (MODY) to assess the performance in disease screening. All heterozygous calls were verified by Sanger sequencing or qPCR. The positive predictive value (PPV) of each step was estimated to evaluate the performance of our procedure. Results: We analyzed SNP array data from 43,433 individuals, which interrogated 267,247 rare variants. The advanced normalization and rare het adjustment methods adjusted genotyping calling of 168,134 variants (96.49%). We further removed 3916 probesets which were discordant in MAFs between the SNP array and sequencing data. The PPV for detecting pathogenic variants with 0.01%10,000 are available. The results demonstrated our procedure could perform correct genotype calling of rare variants. It provides a solution of pathogenic variant detection through SNP array. The approach brings tremendous promise for implementing precision medicine in medical practice.
Collapse
Affiliation(s)
- Ting-Hsuan Sun
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Yu-Hsuan Joni Shao
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
- Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan
| | - Chien-Lin Mao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Miao-Neng Hung
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Yi-Yun Lo
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Tai-Ming Ko
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan
| | - Tzu-Hung Hsiao
- Department of Medical Research, Taichung Veterans General Hospital, Taichung, Taiwan
- Department of Public Health, Fu Jen Catholic University, New Taipei City, Taiwan
- Institute of Genomics and Bioinformatics, National Chung Hsing University, Taichung, Taiwan
- Research Center for Biomedical Science and Engineering, National Tsing Hua University, Hsinchu, Taiwan
| |
Collapse
|
4
|
McEwan AR, MacKenzie A. Perspective: Quality Versus Quantity; Is It Important to Assess the Role of Enhancers in Complex Disease from an In Vivo Perspective? Int J Mol Sci 2020; 21:E7856. [PMID: 33113946 PMCID: PMC7660172 DOI: 10.3390/ijms21217856] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Revised: 10/15/2020] [Accepted: 10/20/2020] [Indexed: 12/18/2022] Open
Abstract
Sequencing of the human genome has permitted the development of genome-wide association studies (GWAS) to analyze the genetics of a number of complex disorders such as depression, anxiety and substance abuse. Thanks to their ability to analyze huge cohort sizes, these studies have successfully identified thousands of loci associated with a broad spectrum of complex diseases. Disconcertingly, the majority of these GWAS hits occur in non-coding regions of the genome, much of which controls the cell-type-specific expression of genes essential to health. In contrast to gene coding sequences, it is a challenge to understand the function of this non-coding regulatory genome using conventional biochemical techniques in cell lines. The current commentary scrutinizes the field of complex genetics from the standpoint of the large-scale whole-genome functional analysis of the promoters and cis-regulatory elements using chromatin markers. We contrast these large scale quantitative techniques against comparative genomics and in vivo analyses including CRISPR/CAS9 genome editing to determine the functional characteristics of these elements and to understand how polymorphic variation and epigenetic changes within these elements might contribute to complex disease and drug response. Most importantly, we suggest that, although the role of chromatin markers will continue to be important in identifying and characterizing enhancers, more emphasis must be placed on their analysis in relevant in-vivo models that take account of the appropriate cell-type-specific roles of these elements. It is hoped that offering these insights might refocus progress in analyzing the data tsunami of non-coding GWAS and whole-genome sequencing "hits" that threatens to overwhelm progress in the field.
Collapse
Affiliation(s)
| | - Alasdair MacKenzie
- School of Medicine, Medical Sciences and Nutrition, Institute of Medical Sciences, Foresterhill, University of Aberdeen, Aberdeen AB25 2ZD, UK;
| |
Collapse
|
5
|
Abstract
PURPOSE OF REVIEW The goal of this review is to summarize the state of big data analyses in the study of heart failure (HF). We discuss the use of big data in the HF space, focusing on "omics" and clinical data. We address some limitations of this data, as well as their future potential. RECENT FINDINGS Omics are providing insight into plasmal and myocardial molecular profiles in HF patients. The introduction of single cell and spatial technologies is a major advance that will reshape our understanding of cell heterogeneity and function as well as tissue architecture. Clinical data analysis focuses on HF phenotyping and prognostic modeling. Big data approaches are increasingly common in HF research. The use of methods designed for big data, such as machine learning, may help elucidate the biology underlying HF. However, important challenges remain in the translation of this knowledge into improvements in clinical care.
Collapse
Affiliation(s)
- Jan D Lanzer
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany
- Internal Medicine II, Heidelberg University Hospital, Heidelberg, Germany
| | - Florian Leuschner
- Department of Cardiology, Medical University Hospital, Heidelberg, Germany
- DZHK (German Centre for Cardiovascular Research), Heidelberg, Germany
| | - Rafael Kramann
- Department of Nephrology and Clinical Immunology, RWTH Aachen University, Aachen, Germany
- Department of Internal Medicine, Nephrology and Transplantation, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Rebecca T Levinson
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany
- Internal Medicine II, Heidelberg University Hospital, Heidelberg, Germany
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Bioquant, Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Heidelberg, Germany.
- Joint Research Centre for Computational Biomedicine (JRC-COMBINE), Faculty of Medicine, RWTH Aachen University, Aachen, Germany.
| |
Collapse
|
6
|
Vuckovic D, Mezzavilla M, Cocca M, Morgan A, Brumat M, Catamo E, Concas MP, Biino G, Franzè A, Ambrosetti U, Pirastu M, Gasparini P, Girotto G. Whole-genome sequencing reveals new insights into age-related hearing loss: cumulative effects, pleiotropy and the role of selection. Eur J Hum Genet 2018; 26:1167-1179. [PMID: 29725052 PMCID: PMC6057993 DOI: 10.1038/s41431-018-0126-2] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2017] [Revised: 02/05/2018] [Accepted: 02/13/2018] [Indexed: 01/17/2023] Open
Abstract
Age-related hearing loss (ARHL) is the most common sensory disorder in the elderly. Although not directly life threatening, it contributes to loss of autonomy and is associated with anxiety, depression and cognitive decline. To search for genetic risk factors underlying ARHL, a large whole-genome sequencing (WGS) approach has been carried out in a cohort of 212 cases and controls, both older than 50 years to select genes characterized by a burden of variants specific to cases or controls. Accordingly, the total variation load per gene was compared and two groups were detected: 375 genes more variable in cases and 371 more variable in controls. In both cases, Gene Ontology analysis showed that the largest enrichment for biological processes (fold > 5, p-value = 0.042) was the “sensory perception of sound”, suggesting cumulative genetic effects were involved. Replication confirmed 141 genes, while additional analysis based on natural selection led to a prioritization of 21 genes. The majority of them (20 out of 21) showed positive expression in mouse cochlea cDNA and were associated with two functional pathways. Among them, two genes were previously associated with hearing (CSMD1 and PTRPD) and re-sequenced in a large Italian cohort of ARHL patients (N = 389). Results led to the identification of six coding variants not detected in cases so far, suggesting a possible protective role, which requires investigation. In conclusion, we show that this multistep strategy (WGS, selection, expression, pathway analysis and targeted re-sequencing) can provide major insights into the molecular characterization of complex diseases such as ARHL.
Collapse
Affiliation(s)
- Dragana Vuckovic
- Medical Sciences, Chirurgical and Health Department, University of Trieste, Trieste, Italy. .,Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy.
| | - Massimo Mezzavilla
- Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Massimiliano Cocca
- Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Anna Morgan
- Medical Sciences, Chirurgical and Health Department, University of Trieste, Trieste, Italy.,Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Marco Brumat
- Medical Sciences, Chirurgical and Health Department, University of Trieste, Trieste, Italy
| | - Eulalia Catamo
- Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Maria Pina Concas
- Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Ginevra Biino
- Institute of Molecular Genetics, National Research Council of Italy, Pavia, Italy
| | - Annamaria Franzè
- Ceinge Advanced Biotechnology, Naples, Italy.,Neuroscience, Reproductive and Odontology Sciences Department, University of Naples "Federico II", Naples, Italy
| | - Umberto Ambrosetti
- UO Audiology, Fondazione IRCCS Ca Granda, Ospedale Maggiore Policlinico, Mangiagalli e Regina Elena, Milan, Italy.,Audiology Unit, Department of Clinical Sciences and Community Health, University of Milan, Milan, Italy
| | - Mario Pirastu
- Institute of Population Genetics, National Research Council of Italy, Sassari, Italy
| | - Paolo Gasparini
- Medical Sciences, Chirurgical and Health Department, University of Trieste, Trieste, Italy.,Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| | - Giorgia Girotto
- Medical Sciences, Chirurgical and Health Department, University of Trieste, Trieste, Italy.,Medical Genetics, Institute for Maternal and Child Health - IRCCS "Burlo Garofolo", Trieste, Italy
| |
Collapse
|
7
|
Panasiewicz G, Bieniek-Kobuszewska M, Lipka A, Majewska M, Jedryczko R, Szafranska B. Novel effects of identified SNPs within the porcine Pregnancy-Associated Glycoprotein gene family (pPAGs) on the major reproductive traits in Hirschmann hybrid-line sows. Res Vet Sci 2017; 114:123-130. [PMID: 28371694 DOI: 10.1016/j.rvsc.2017.03.015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2016] [Revised: 03/18/2017] [Accepted: 03/27/2017] [Indexed: 02/07/2023]
Abstract
This is the first study describing identification of SNPs within the multiple and polymorphic Pregnancy-Associated Glycoprotein gene family (PAGs) in the genome of the domestic pig (pPAGs). We identified pPAG-like (pPAG-L) genotypes in primiparous and multiparous farmed hybrid-line JSR Hirschmann (Hrn) sows (N=159), in which various novel associations with their phenotypes for the major reproductive traits have been discovered. Genomic DNA templates were isolated from the blood and different pPAG-L primers were used to amplify various regions by PCR. Electrophoretically-separated amplicons were selected, purified and sequenced. All identified SNPs were verified for possible pPAG2-L genotype associations with the major reproductive traits. In total, 196 SNPs were identified within the entire structure of the pPAG2-Ls, encompassing 9 exons and 8 (A-H) introns, resembling all aspartic proteinases. It was discovered that among all SNPs, one diplotype localized in exon 6 (657C>T/749G>C; pPAG2 ORF cDNA numbering; L34361) caused amino acid substitutions (Asp220→Asn and Ser250→Thr) in the polypeptide precursors and was associated with an increase in the number of live-born piglets (P≤0.05) in Hrn sows. In turn, co-localized SNP (504g>a; KF537535 numbering) in the intron F of the pPAG2-Ls, but only in the homozygotic genotype (gg), was associated with an increased number of live-born (P≤0.01) and weaned (P≤0.05) piglets in the Hrn sows. These results qualify the pPAG2-Ls as candidate genes of the main QTLs. The novel pPAG SNP profiles provide the basis for a diagnostic genotyping test required for early pre-selection of female/male piglets, presumably mainly useful in various breeding herds.
Collapse
Affiliation(s)
- Grzegorz Panasiewicz
- Department of Animal Physiology, Faculty of Biology and Biotechnology, University of Warmia and Mazury in Olsztyn, ul. Oczapowskiego 1A, 10-719 Olsztyn-Kortowo, Poland.
| | - Martyna Bieniek-Kobuszewska
- Department of Animal Physiology, Faculty of Biology and Biotechnology, University of Warmia and Mazury in Olsztyn, ul. Oczapowskiego 1A, 10-719 Olsztyn-Kortowo, Poland; Department of Dermatology, Sexually Transmitted Diseases and Clinical Immunology, Faculty of Medical Sciences, University of Warmia and Mazury in Olsztyn, ul. Wojska Polskiego 30, 10-229 Olsztyn, Poland
| | - Aleksandra Lipka
- Department of Animal Physiology, Faculty of Biology and Biotechnology, University of Warmia and Mazury in Olsztyn, ul. Oczapowskiego 1A, 10-719 Olsztyn-Kortowo, Poland
| | - Marta Majewska
- Department of Human Physiology, Faculty of Medical Sciences, University of Warmia and Mazury in Olsztyn, ul. Warszawska 30, 10-082 Olsztyn, Poland
| | | | - Bozena Szafranska
- Department of Animal Physiology, Faculty of Biology and Biotechnology, University of Warmia and Mazury in Olsztyn, ul. Oczapowskiego 1A, 10-719 Olsztyn-Kortowo, Poland
| |
Collapse
|
8
|
Abstract
Over the past few years, interest in the identification of rare variants that influence human phenotype has led to the development of many statistical methods for testing for association between sets of rare variants and binary or quantitative traits. Here, I review some of the most important ideas that underlie these methods and the most relevant issues when choosing a method for analysis. In addition to the tests for association, I review crucial issues in performing a rare variant study, from experimental design to interpretation and validation. I also discuss the many challenges of these studies, some of their limitations, and future research directions.
Collapse
Affiliation(s)
- Dan L Nicolae
- Departments of Medicine and Statistics, University of Chicago, Chicago, Illinois 60637;
| |
Collapse
|
9
|
A stop-codon of the phosphodiesterase 11A gene is associated with elevated blood pressure and measures of obesity. J Hypertens 2016; 34:445-51; discussion 451. [DOI: 10.1097/hjh.0000000000000821] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
10
|
Kardos M, Husby A, McFarlane SE, Qvarnström A, Ellegren H. Whole-genome resequencing of extreme phenotypes in collared flycatchers highlights the difficulty of detecting quantitative trait loci in natural populations. Mol Ecol Resour 2015; 16:727-41. [PMID: 26649993 DOI: 10.1111/1755-0998.12498] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2015] [Revised: 11/18/2015] [Accepted: 11/30/2015] [Indexed: 12/24/2022]
Abstract
Dissecting the genetic basis of phenotypic variation in natural populations is a long-standing goal in evolutionary biology. One open question is whether quantitative traits are determined only by large numbers of genes with small effects, or whether variation also exists in large-effect loci. We conducted genomewide association analyses of forehead patch size (a sexually selected trait) on 81 whole-genome-resequenced male collared flycatchers with extreme phenotypes, and on 415 males sampled independent of patch size and genotyped with a 50K SNP chip. No SNPs were genomewide statistically significantly associated with patch size. Simulation-based power analyses suggest that the power to detect large-effect loci responsible for 10% of phenotypic variance was <0.5 in the genome resequencing analysis, and <0.1 in the SNP chip analysis. Reducing the recombination by two-thirds relative to collared flycatchers modestly increased power. Tripling sample size increased power to >0.8 for resequencing of extreme phenotypes (N = 243), but power remained <0.2 for the 50K SNP chip analysis (N = 1245). At least 1 million SNPs were necessary to achieve power >0.8 when analysing 415 randomly sampled phenotypes. However, power of the 50K SNP chip to detect large-effect loci was nearly 0.8 in simulations with a small effective population size of 1500. These results suggest that reliably detecting large-effect trait loci in large natural populations will often require thousands of individuals and near complete sampling of the genome. Encouragingly, far fewer individuals and loci will often be sufficient to reliably detect large-effect loci in small populations with widespread strong linkage disequilibrium.
Collapse
Affiliation(s)
- Marty Kardos
- Department of Evolutionary Biology, Evolutionary Biology Centre (EBC), Uppsala University, Norbyvägen 18D, Uppsala, 75236, Sweden
| | - Arild Husby
- Department of Biosciences, University of Helsinki, PO Box 65, Helsinki, 00014, Finland.,Centre for Biodiversity Dynamics, Department of Biology, Norwegian University of Science and Technology, Trondheim, 7491, Norway
| | - S Eryn McFarlane
- Department of Animal Ecology, Evolutionary Biology Centre (EBC), Uppsala University, Norbyvägen 18D, Uppsala, 75236, Sweden
| | - Anna Qvarnström
- Department of Animal Ecology, Evolutionary Biology Centre (EBC), Uppsala University, Norbyvägen 18D, Uppsala, 75236, Sweden
| | - Hans Ellegren
- Department of Evolutionary Biology, Evolutionary Biology Centre (EBC), Uppsala University, Norbyvägen 18D, Uppsala, 75236, Sweden
| |
Collapse
|
11
|
Upton A, Trelles O, Cornejo-García JA, Perkins JR. Review: High-performance computing to detect epistasis in genome scale data sets. Brief Bioinform 2015; 17:368-79. [PMID: 26272945 DOI: 10.1093/bib/bbv058] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2015] [Indexed: 11/14/2022] Open
Abstract
It is becoming clear that most human diseases have a complex etiology that cannot be explained by single nucleotide polymorphisms (SNPs) or simple additive combinations; the general consensus is that they are caused by combinations of multiple genetic variations. The limited success of some genome-wide association studies is partly a result of this focus on single genetic markers. A more promising approach is to take into account epistasis, by considering the association of multiple SNP interactions with disease. However, as genomic data continues to grow in resolution, and genome and exome sequencing become more established, the number of combinations of variants to consider increases rapidly. Two potential solutions should be considered: the use of high-performance computing, which allows us to consider a larger number of variables, and heuristics to make the solution more tractable, essential in the case of genome sequencing. In this review, we look at different computational methods to analyse epistatic interactions within disease-related genetic data sets created by microarray technology. We also review efforts to use epistatic analysis results to produce biomarkers for diagnostic tests and give our views on future directions in this field in light of advances in sequencing technology and variants in non-coding regions.
Collapse
|
12
|
Livne OE, Han L, Alkorta-Aranburu G, Wentworth-Sheilds W, Abney M, Ober C, Nicolae DL. PRIMAL: Fast and accurate pedigree-based imputation from sequence data in a founder population. PLoS Comput Biol 2015; 11:e1004139. [PMID: 25735005 PMCID: PMC4348507 DOI: 10.1371/journal.pcbi.1004139] [Citation(s) in RCA: 34] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2014] [Accepted: 01/19/2015] [Indexed: 12/31/2022] Open
Abstract
Founder populations and large pedigrees offer many well-known advantages for genetic mapping studies, including cost-efficient study designs. Here, we describe PRIMAL (PedigRee IMputation ALgorithm), a fast and accurate pedigree-based phasing and imputation algorithm for founder populations. PRIMAL incorporates both existing and original ideas, such as a novel indexing strategy of Identity-By-Descent (IBD) segments based on clique graphs. We were able to impute the genomes of 1,317 South Dakota Hutterites, who had genome-wide genotypes for ~300,000 common single nucleotide variants (SNVs), from 98 whole genome sequences. Using a combination of pedigree-based and LD-based imputation, we were able to assign 87% of genotypes with >99% accuracy over the full range of allele frequencies. Using the IBD cliques we were also able to infer the parental origin of 83% of alleles, and genotypes of deceased recent ancestors for whom no genotype information was available. This imputed data set will enable us to better study the relative contribution of rare and common variants on human phenotypes, as well as parental origin effect of disease risk alleles in >1,000 individuals at minimal cost.
Collapse
Affiliation(s)
- Oren E. Livne
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Lide Han
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Gorka Alkorta-Aranburu
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - William Wentworth-Sheilds
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Mark Abney
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Carole Ober
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
| | - Dan L. Nicolae
- Department of Human Genetics, The University of Chicago, Chicago, Illinois, United States of America
- Departments of Medicine, and Statistics, The University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|