1
|
Fast two-stage phasing of large-scale sequence data. Am J Hum Genet 2021; 108:1880-1890. [PMID: 34478634 DOI: 10.1016/j.ajhg.2021.08.005] [Citation(s) in RCA: 302] [Impact Index Per Article: 75.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2021] [Accepted: 08/10/2021] [Indexed: 01/02/2023] Open
Abstract
Haplotype phasing is the estimation of haplotypes from genotype data. We present a fast, accurate, and memory-efficient haplotype phasing method that scales to large-scale SNP array and sequence data. The method uses marker windowing and composite reference haplotypes to reduce memory usage and computation time. It incorporates a progressive phasing algorithm that identifies confidently phased heterozygotes in each iteration and fixes the phase of these heterozygotes in subsequent iterations. For data with many low-frequency variants, such as whole-genome sequence data, the method employs a two-stage phasing algorithm that phases high-frequency markers via progressive phasing in the first stage and phases low-frequency markers via genotype imputation in the second stage. This haplotype phasing method is implemented in the open-source Beagle 5.2 software package. We compare Beagle 5.2 and SHAPEIT 4.2.1 by using expanding subsets of 485,301 UK Biobank samples and 38,387 TOPMed samples. Both methods have very similar accuracy and computation time for UK Biobank SNP array data. However, for TOPMed sequence data, Beagle is more than 20 times faster than SHAPEIT, achieves similar accuracy, and scales to larger sample sizes.
Collapse
|
Research Support, N.I.H., Extramural |
4 |
302 |
2
|
Chen H, Huffman JE, Brody JA, Wang C, Lee S, Li Z, Gogarten SM, Sofer T, Bielak LF, Bis JC, Blangero J, Bowler RP, Cade BE, Cho MH, Correa A, Curran JE, de Vries PS, Glahn DC, Guo X, Johnson AD, Kardia S, Kooperberg C, Lewis JP, Liu X, Mathias RA, Mitchell BD, O’Connell JR, Peyser PA, Post WS, Reiner AP, Rich SS, Rotter JI, Silverman EK, Smith JA, Vasan RS, Wilson JG, Yanek LR, Redline S, Smith NL, Boerwinkle E, Borecki IB, Cupples LA, Laurie CC, Morrison AC, Rice KM, Lin X, Rice KM, Lin X. Efficient Variant Set Mixed Model Association Tests for Continuous and Binary Traits in Large-Scale Whole-Genome Sequencing Studies. Am J Hum Genet 2019; 104:260-274. [PMID: 30639324 DOI: 10.1016/j.ajhg.2018.12.012] [Citation(s) in RCA: 82] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2018] [Accepted: 12/17/2018] [Indexed: 12/12/2022] Open
Abstract
With advances in whole-genome sequencing (WGS) technology, more advanced statistical methods for testing genetic association with rare variants are being developed. Methods in which variants are grouped for analysis are also known as variant-set, gene-based, and aggregate unit tests. The burden test and sequence kernel association test (SKAT) are two widely used variant-set tests, which were originally developed for samples of unrelated individuals and later have been extended to family data with known pedigree structures. However, computationally efficient and powerful variant-set tests are needed to make analyses tractable in large-scale WGS studies with complex study samples. In this paper, we propose the variant-set mixed model association tests (SMMAT) for continuous and binary traits using the generalized linear mixed model framework. These tests can be applied to large-scale WGS studies involving samples with population structure and relatedness, such as in the National Heart, Lung, and Blood Institute's Trans-Omics for Precision Medicine (TOPMed) program. SMMATs share the same null model for different variant sets, and a virtue of this null model, which includes covariates only, is that it needs to be fit only once for all tests in each genome-wide analysis. Simulation studies show that all the proposed SMMATs correctly control type I error rates for both continuous and binary traits in the presence of population structure and relatedness. We also illustrate our tests in a real data example of analysis of plasma fibrinogen levels in the TOPMed program (n = 23,763), using the Analysis Commons, a cloud-based computing platform.
Collapse
|
Research Support, N.I.H., Extramural |
6 |
82 |
3
|
Peljto AL, Blumhagen RZ, Walts AD, Cardwell J, Powers J, Corte TJ, Dickinson JL, Glaspole I, Moodley YP, Vasakova MK, Bendstrup E, Davidsen JR, Borie R, Crestani B, Dieude P, Bonella F, Costabel U, Gudmundsson G, Donnelly SC, Egan J, Henry MT, Keane MP, Kennedy MP, McCarthy C, McElroy AN, Olaniyi JA, O’Reilly KMA, Richeldi L, Leone PM, Poletti V, Puppo F, Tomassetti S, Luzzi V, Kokturk N, Mogulkoc N, Fiddler CA, Hirani N, Jenkins RG, Maher TM, Molyneaux PL, Parfrey H, Braybrooke R, Blackwell TS, Jackson PD, Nathan SD, Porteous MK, Brown KK, Christie JD, Collard HR, Eickelberg O, Foster EE, Gibson KF, Glassberg M, Kass DJ, Kropski JA, Lederer D, Linderholm AL, Loyd J, Mathai SK, Montesi SB, Noth I, Oldham JM, Palmisciano AJ, Reichner CA, Rojas M, Roman J, Schluger N, Shea BS, Swigris JJ, Wolters PJ, Zhang Y, Prele CMA, Enghelmayer JI, Otaola M, Ryerson CJ, Salinas M, Sterclova M, Gebremariam TH, Myllärniemi M, Carbone RG, Furusawa H, Hirose M, Inoue Y, Miyazaki Y, Ohta K, Ohta S, Okamoto T, Kim DS, Pardo A, Selman M, Aranda AU, Park MS, Park JS, Song JW, Molina-Molina M, Planas-Cerezales L, Westergren-Thorsson G, Smith AV, Manichaikul AW, Kim JS, et alPeljto AL, Blumhagen RZ, Walts AD, Cardwell J, Powers J, Corte TJ, Dickinson JL, Glaspole I, Moodley YP, Vasakova MK, Bendstrup E, Davidsen JR, Borie R, Crestani B, Dieude P, Bonella F, Costabel U, Gudmundsson G, Donnelly SC, Egan J, Henry MT, Keane MP, Kennedy MP, McCarthy C, McElroy AN, Olaniyi JA, O’Reilly KMA, Richeldi L, Leone PM, Poletti V, Puppo F, Tomassetti S, Luzzi V, Kokturk N, Mogulkoc N, Fiddler CA, Hirani N, Jenkins RG, Maher TM, Molyneaux PL, Parfrey H, Braybrooke R, Blackwell TS, Jackson PD, Nathan SD, Porteous MK, Brown KK, Christie JD, Collard HR, Eickelberg O, Foster EE, Gibson KF, Glassberg M, Kass DJ, Kropski JA, Lederer D, Linderholm AL, Loyd J, Mathai SK, Montesi SB, Noth I, Oldham JM, Palmisciano AJ, Reichner CA, Rojas M, Roman J, Schluger N, Shea BS, Swigris JJ, Wolters PJ, Zhang Y, Prele CMA, Enghelmayer JI, Otaola M, Ryerson CJ, Salinas M, Sterclova M, Gebremariam TH, Myllärniemi M, Carbone RG, Furusawa H, Hirose M, Inoue Y, Miyazaki Y, Ohta K, Ohta S, Okamoto T, Kim DS, Pardo A, Selman M, Aranda AU, Park MS, Park JS, Song JW, Molina-Molina M, Planas-Cerezales L, Westergren-Thorsson G, Smith AV, Manichaikul AW, Kim JS, Rich SS, Oelsner EC, Barr RG, Rotter JI, Dupuis J, O’Connor G, Vasan RS, Cho MH, Silverman EK, Schwarz MI, Steele MP, Lee JS, Yang IV, Fingerlin TE, Schwartz DA. Idiopathic Pulmonary Fibrosis Is Associated with Common Genetic Variants and Limited Rare Variants. Am J Respir Crit Care Med 2023; 207:1194-1202. [PMID: 36602845 PMCID: PMC10161752 DOI: 10.1164/rccm.202207-1331oc] [Show More Authors] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 01/04/2023] [Indexed: 01/06/2023] Open
Abstract
Rationale: Idiopathic pulmonary fibrosis (IPF) is a rare, irreversible, and progressive disease of the lungs. Common genetic variants, in addition to nongenetic factors, have been consistently associated with IPF. Rare variants identified by candidate gene, family-based, and exome studies have also been reported to associate with IPF. However, the extent to which rare variants, genome-wide, may contribute to the risk of IPF remains unknown. Objectives: We used whole-genome sequencing to investigate the role of rare variants, genome-wide, on IPF risk. Methods: As part of the Trans-Omics for Precision Medicine Program, we sequenced 2,180 cases of IPF. Association testing focused on the aggregated effect of rare variants (minor allele frequency ⩽0.01) within genes or regions. We also identified individual rare variants that are influential within genes and estimated the heritability of IPF on the basis of rare and common variants. Measurements and Main Results: Rare variants in both TERT and RTEL1 were significantly associated with IPF. A single rare variant in each of the TERT and RTEL1 genes was found to consistently influence the aggregated test statistics. There was no significant evidence of association with other previously reported rare variants. The SNP heritability of IPF was estimated to be 32% (SE = 3%). Conclusions: Rare variants within the TERT and RTEL1 genes and well-established common variants have the largest contribution to IPF risk overall. Efforts in risk profiling or the development of therapies for IPF that focus on TERT, RTEL1, common variants, and environmental risk factors are likely to have the largest impact on this complex disease.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
27 |
4
|
Jun G, English AC, Metcalf GA, Yang J, Chaisson MJP, Pankratz N, Menon VK, Salerno WJ, Krasheninina O, Smith AV, Lane JA, Blackwell T, Kang HM, Salvi S, Meng Q, Shen H, Pasham D, Bhamidipati S, Kottapalli K, Arnett DK, Ashley-Koch A, Auer PL, Beutel KM, Bis JC, Blangero J, Bowden DW, Brody JA, Cade BE, Chen YDI, Cho MH, Curran JE, Fornage M, Freedman BI, Fingerlin T, Gelb BD, Hou L, Hung YJ, Kane JP, Kaplan R, Kim W, Loos RJ, Marcus GM, Mathias RA, McGarvey ST, Montgomery C, Naseri T, Nouraie SM, Preuss MH, Palmer ND, Peyser PA, Raffield LM, Ratan A, Redline S, Reupena S, Rotter JI, Rich SS, Rienstra M, Ruczinski I, Sankaran VG, Schwartz DA, Seidman CE, Seidman JG, Silverman EK, Smith JA, Stilp A, Taylor KD, Telen MJ, Weiss ST, Williams LK, Wu B, Yanek LR, Zhang Y, Lasky-Su J, Gingras MC, Dutcher SK, Eichler EE, Gabriel S, Germer S, Kim R, Viaud-Martinez KA, Nickerson DA, Luo J, Reiner A, Gibbs RA, Boerwinkle E, Abecasis G, Sedlazeck FJ. Structural variation across 138,134 samples in the TOPMed consortium. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.25.525428. [PMID: 36747810 PMCID: PMC9900832 DOI: 10.1101/2023.01.25.525428] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Ever larger Structural Variant (SV) catalogs highlighting the diversity within and between populations help researchers better understand the links between SVs and disease. The identification of SVs from DNA sequence data is non-trivial and requires a balance between comprehensiveness and precision. Here we present a catalog of 355,667 SVs (59.34% novel) across autosomes and the X chromosome (50bp+) from 138,134 individuals in the diverse TOPMed consortium. We describe our methodologies for SV inference resulting in high variant quality and >90% allele concordance compared to long-read de-novo assemblies of well-characterized control samples. We demonstrate utility through significant associations between SVs and important various cardio-metabolic and hemotologic traits. We have identified 690 SV hotspots and deserts and those that potentially impact the regulation of medically relevant genes. This catalog characterizes SVs across multiple populations and will serve as a valuable tool to understand the impact of SV on disease development and progression.
Collapse
|
Preprint |
2 |
8 |
5
|
Sengupta D, Botha G, Meintjes A, Mbiyavanga M, Hazelhurst S, Mulder N, Ramsay M, Choudhury A. Performance and accuracy evaluation of reference panels for genotype imputation in sub-Saharan African populations. CELL GENOMICS 2023; 3:100332. [PMID: 37388906 PMCID: PMC10300601 DOI: 10.1016/j.xgen.2023.100332] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 02/11/2023] [Accepted: 05/02/2023] [Indexed: 07/01/2023]
Abstract
Based on evaluations of imputation performed on a genotype dataset consisting of about 11,000 sub-Saharan African (SSA) participants, we show Trans-Omics for Precision Medicine (TOPMed) and the African Genome Resource (AGR) to be currently the best panels for imputing SSA datasets. We report notable differences in the number of single-nucleotide polymorphisms (SNPs) that are imputed by different panels in datasets from East, West, and South Africa. Comparisons with a subset of 95 SSA high-coverage whole-genome sequences (WGSs) show that despite being about 20-fold smaller, the AGR imputed dataset has higher concordance with the WGSs. Moreover, the level of concordance between imputed and WGS datasets was strongly influenced by the extent of Khoe-San ancestry in a genome, highlighting the need for integration of not only geographically but also ancestrally diverse WGS data in reference panels for further improvement in imputation of SSA datasets. Approaches that integrate imputed data from different panels could also lead to better imputation.
Collapse
|
research-article |
2 |
4 |
6
|
Jun G, English AC, Metcalf GA, Yang J, Chaisson MJP, Pankratz N, Menon VK, Salerno WJ, Krasheninina O, Smith AV, Lane JA, Blackwell T, Kang HM, Salvi S, Meng Q, Shen H, Pasham D, Bhamidipati S, Kottapalli K, Arnett DK, Ashley-Koch A, Auer PL, Beutel KM, Bis JC, Blangero J, Bowden DW, Brody JA, Cade BE, Chen YDI, Cho MH, Curran JE, Fornage M, Freedman BI, Fingerlin T, Gelb BD, Hou L, Hung YJ, Kane JP, Kaplan R, Kim W, Loos RJ, Marcus GM, Mathias RA, McGarvey ST, Montgomery C, Naseri T, Nouraie SM, Preuss MH, Palmer ND, Peyser PA, Raffield LM, Ratan A, Redline S, Reupena S, Rotter JI, Rich SS, Rienstra M, Ruczinski I, Sankaran VG, Schwartz DA, Seidman CE, Seidman JG, Silverman EK, Smith JA, Stilp A, Taylor KD, Telen MJ, Weiss ST, Williams LK, Wu B, Yanek LR, Zhang Y, Lasky-Su J, Gingras MC, Dutcher SK, Eichler EE, Gabriel S, Germer S, Kim R, Viaud-Martinez KA, Nickerson DA, Luo J, Reiner A, Gibbs RA, Boerwinkle E, Abecasis G, Sedlazeck FJ. Structural variation across 138,134 samples in the TOPMed consortium. RESEARCH SQUARE 2023:rs.3.rs-2515453. [PMID: 36778386 PMCID: PMC9915771 DOI: 10.21203/rs.3.rs-2515453/v1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Ever larger Structural Variant (SV) catalogs highlighting the diversity within and between populations help researchers better understand the links between SVs and disease. The identification of SVs from DNA sequence data is non-trivial and requires a balance between comprehensiveness and precision. Here we present a catalog of 355,667 SVs (59.34% novel) across autosomes and the X chromosome (50bp+) from 138,134 individuals in the diverse TOPMed consortium. We describe our methodologies for SV inference resulting in high variant quality and >90% allele concordance compared to long-read de-novo assemblies of well-characterized control samples. We demonstrate utility through significant associations between SVs and important various cardio-metabolic and hematologic traits. We have identified 690 SV hotspots and deserts and those that potentially impact the regulation of medically relevant genes. This catalog characterizes SVs across multiple populations and will serve as a valuable tool to understand the impact of SV on disease development and progression.
Collapse
|
Preprint |
2 |
1 |
7
|
Armstrong ND, Srinivasasainagendra V, Ammous F, Assimes TL, Beitelshees AL, Brody J, Cade BE, Ida Chen YD, Chen H, de Vries PS, Floyd JS, Franceschini N, Guo X, Hellwege JN, House JS, Hwu CM, Kardia SLR, Lange EM, Lange LA, McDonough CW, Montasser ME, O’Connell JR, Shuey MM, Sun X, Tanner RM, Wang Z, Zhao W, Carson AP, Edwards TL, Kelly TN, Kenny EE, Kooperberg C, Loos RJF, Morrison AC, Motsinger-Reif A, Psaty BM, Rao DC, Redline S, Rich SS, Rotter JI, Smith JA, Smith AV, Irvin MR, Arnett DK. Whole genome sequence analysis of apparent treatment resistant hypertension status in participants from the Trans-Omics for Precision Medicine program. Front Genet 2023; 14:1278215. [PMID: 38162683 PMCID: PMC10755672 DOI: 10.3389/fgene.2023.1278215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Accepted: 11/24/2023] [Indexed: 01/03/2024] Open
Abstract
Introduction: Apparent treatment-resistant hypertension (aTRH) is characterized by the use of four or more antihypertensive (AHT) classes to achieve blood pressure (BP) control. In the current study, we conducted single-variant and gene-based analyses of aTRH among individuals from 12 Trans-Omics for Precision Medicine cohorts with whole-genome sequencing data. Methods: Cases were defined as individuals treated for hypertension (HTN) taking three different AHT classes, with average systolic BP ≥ 140 or diastolic BP ≥ 90 mmHg, or four or more medications regardless of BP (n = 1,705). A normotensive control group was defined as individuals with BP < 140/90 mmHg (n = 22,079), not on AHT medication. A second control group comprised individuals who were treatment responsive on one AHT medication with BP < 140/ 90 mmHg (n = 5,424). Logistic regression with kinship adjustment using the Scalable and Accurate Implementation of Generalized mixed models (SAIGE) was performed, adjusting for age, sex, and genetic ancestry. We assessed variants using SKAT-O in rare-variant analyses. Single-variant and gene-based tests were conducted in a pooled multi-ethnicity stratum, as well as self-reported ethnic/racial strata (European and African American). Results: One variant in the known HTN locus, KCNK3, was a top finding in the multi-ethnic analysis (p = 8.23E-07) for the normotensive control group [rs12476527, odds ratio (95% confidence interval) = 0.80 (0.74-0.88)]. This variant was replicated in the Vanderbilt University Medical Center's DNA repository data. Aggregate gene-based signals included the genes AGTPBP, MYL4, PDCD4, BBS9, ERG, and IER3. Discussion: Additional work validating these loci in larger, more diverse populations, is warranted to determine whether these regions influence the pathobiology of aTRH.
Collapse
|
research-article |
2 |
|
8
|
Einson J, Glinos D, Boerwinkle E, Castaldi P, Darbar D, de Andrade M, Ellinor P, Fornage M, Gabriel S, Germer S, Gibbs R, Hersh CP, Johnsen J, Kaplan R, Konkle BA, Kooperberg C, Nassir R, Loos RJF, Meyers DA, Mitchell BD, Psaty B, Vasan RS, Rich SS, Rienstra M, Rotter JI, Saferali A, Shoemaker MB, Silverman E, Smith AV, Mohammadi P, Castel SE, Iossifov I, Lappalainen T. Genetic control of mRNA splicing as a potential mechanism for incomplete penetrance of rare coding variants. Genetics 2023; 224:iyad115. [PMID: 37348055 PMCID: PMC10411602 DOI: 10.1093/genetics/iyad115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 02/02/2023] [Accepted: 04/18/2023] [Indexed: 06/24/2023] Open
Abstract
Exonic variants present some of the strongest links between genotype and phenotype. However, these variants can have significant inter-individual pathogenicity differences, known as variable penetrance. In this study, we propose a model where genetically controlled mRNA splicing modulates the pathogenicity of exonic variants. By first cataloging exonic inclusion from RNA-sequencing data in GTEx V8, we find that pathogenic alleles are depleted on highly included exons. Using a large-scale phased whole genome sequencing data from the TOPMed consortium, we observe that this effect may be driven by common splice-regulatory genetic variants, and that natural selection acts on haplotype configurations that reduce the transcript inclusion of putatively pathogenic variants, especially when limiting to haploinsufficient genes. Finally, we test if this effect may be relevant for autism risk using families from the Simons Simplex Collection, but find that splicing of pathogenic alleles has a penetrance reducing effect here as well. Overall, our results indicate that common splice-regulatory variants may play a role in reducing the damaging effects of rare exonic variants.
Collapse
|
Research Support, N.I.H., Extramural |
2 |
|
9
|
Browning BL, Browning SR. Genotype error biases trio-based estimates of haplotype phase accuracy. Am J Hum Genet 2022; 109:1016-1025. [PMID: 35659928 PMCID: PMC9247820 DOI: 10.1016/j.ajhg.2022.04.019] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Accepted: 04/29/2022] [Indexed: 11/01/2022] Open
Abstract
Haplotypes can be estimated from unphased genotype data via statistical methods. When parent-offspring trios are available for inferring the true phase from Mendelian inheritance rules, the accuracy of statistical phasing is usually measured by the switch error rate, which is the proportion of pairs of consecutive heterozygotes that are incorrectly phased. We present a method for estimating the genotype error rate from parent-offspring trios and a method for estimating the bias that occurs in the observed switch error rate as a result of genotype error. We apply these methods to 485,301 genotyped UK Biobank samples that include 898 White British trios and to 38,387 sequenced TOPMed samples that include 217 African Caribbean trios and 669 European American trios. We show that genotype error inflates the observed switch error rate and that the relative bias increases with sample size. For the UK Biobank White British trios, the observed switch error rate in the trio offspring is 2.4 times larger than the estimated true switch error rate (1.4 × 10-3 vs 5.8 × 10-4. We propose an alternate definition of phase error that counts two consecutive switch errors as a single error because back-to-back switch errors arise when a single heterozygote is incorrectly phased with respect to the surrounding heterozygotes. With this definition, we estimate that the average distance between phase errors is 64 megabases in the UK Biobank White British individuals.
Collapse
|
Research Support, N.I.H., Extramural |
3 |
|