1
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. Am J Hum Genet 2024; 111:691-700. [PMID: 38513668 PMCID: PMC11023918 DOI: 10.1016/j.ajhg.2024.02.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2023] [Revised: 02/26/2024] [Accepted: 02/27/2024] [Indexed: 03/23/2024] Open
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more computationally efficient inference of identity by descent (IBD) than approaches that infer pairwise IBD segments and provides locus-specific IBD clusters rather than IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2,900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach. Our IBD clustering method is implemented in the open-source ibd-cluster software package.
Collapse
Affiliation(s)
- Sharon R Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| | - Brian L Browning
- Department of Biostatistics, University of Washington, Seattle, WA, USA; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA, USA.
| |
Collapse
|
2
|
Chen H, Naseri A, Zhi D. FiMAP: A fast identity-by-descent mapping test for biobank-scale cohorts. PLoS Genet 2023; 19:e1011057. [PMID: 38039339 PMCID: PMC10718418 DOI: 10.1371/journal.pgen.1011057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Revised: 12/13/2023] [Accepted: 11/07/2023] [Indexed: 12/03/2023] Open
Abstract
Although genome-wide association studies (GWAS) have identified tens of thousands of genetic loci, the genetic architecture is still not fully understood for many complex traits. Most GWAS and sequencing association studies have focused on single nucleotide polymorphisms or copy number variations, including common and rare genetic variants. However, phased haplotype information is often ignored in GWAS or variant set tests for rare variants. Here we leverage the identity-by-descent (IBD) segments inferred from a random projection-based IBD detection algorithm in the mapping of genetic associations with complex traits, to develop a computationally efficient statistical test for IBD mapping in biobank-scale cohorts. We used sparse linear algebra and random matrix algorithms to speed up the computation, and a genome-wide IBD mapping scan of more than 400,000 samples finished within a few hours. Simulation studies showed that our new method had well-controlled type I error rates under the null hypothesis of no genetic association in large biobank-scale cohorts, and outperformed traditional GWAS single-variant tests when the causal variants were untyped and rare, or in the presence of haplotype effects. We also applied our method to IBD mapping of six anthropometric traits using the UK Biobank data and identified a total of 3,442 associations, 2,131 (62%) of which remained significant after conditioning on suggestive tag variants in the ± 3 centimorgan flanking regions from GWAS.
Collapse
Affiliation(s)
- Han Chen
- Human Genetics Center, Department of Epidemiology, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Ardalan Naseri
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| | - Degui Zhi
- Center for Artificial Intelligence and Genome Informatics, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, United States of America
| |
Collapse
|
3
|
Browning SR, Browning BL. Biobank-scale inference of multi-individual identity by descent and gene conversion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565574. [PMID: 37961601 PMCID: PMC10635131 DOI: 10.1101/2023.11.03.565574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
We present a method for efficiently identifying clusters of identical-by-descent haplotypes in biobank-scale sequence data. Our multi-individual approach enables much more efficient collection and storage of identity by descent (IBD) information than approaches that detect and store pairwise IBD segments. Our method's computation time, memory requirements, and output size scale linearly with the number of individuals in the dataset. We also present a method for using multi-individual IBD to detect alleles changed by gene conversion. Application of our methods to the autosomal sequence data for 125,361 White British individuals in the UK Biobank detects more than 9 million converted alleles. This is 2900 times more alleles changed by gene conversion than were detected in a previous analysis of familial data. We estimate that more than 250,000 sequenced probands and a much larger number of additional genomes from multi-generational family members would be required to find a similar number of alleles changed by gene conversion using a family-based approach.
Collapse
Affiliation(s)
| | - Brian L. Browning
- Department of Biostatistics, University of Washington, Seattle, WA
- Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, WA
| |
Collapse
|
4
|
Shemirani R, Belbin GM, Burghardt K, Lerman K, Avery CL, Kenny EE, Gignoux CR, Ambite JL. Selecting Clustering Algorithms for Identity-By-Descent Mapping. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2023; 28:121-132. [PMID: 36540970 PMCID: PMC9782725] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/25/2022]
Abstract
Groups of distantly related individuals who share a short segment of their genome identical-by-descent (IBD) can provide insights about rare traits and diseases in massive biobanks using IBD mapping. Clustering algorithms play an important role in finding these groups accurately and at scale. We set out to analyze the fitness of commonly used, fast and scalable clustering algorithms for IBD mapping applications. We designed a realistic benchmark for local IBD graphs and utilized it to compare the statistical power of clustering algorithms via simulating 2.3 million clusters across 850 experiments. We found Infomap and Markov Clustering (MCL) community detection methods to have high statistical power in most of the scenarios. They yield a 30% increase in power compared to the current state-of-art approach, with a 3 orders of magnitude lower runtime. We also found that standard clustering metrics, such as modularity, cannot predict statistical power of algorithms in IBD mapping applications. We extend our findings to real datasets by analyzing the Population Architecture using Genomics and Epidemiology (PAGE) Study dataset with 51,000 samples and 2 million shared segments on Chromosome 1, resulting in the extraction of 39 million local IBD clusters. We demonstrate the power of our approach by recovering signals of rare genetic variation in the Whole-Exome Sequence data of 200,000 individuals in the UK Biobank. We provide an efficient implementation to enable clustering at scale for IBD mapping for various populations and scenarios.Supplementary Information: The code, along with supplementary methods and figures are available at https://github.com/roohy/localIBDClustering.
Collapse
Affiliation(s)
- Ruhollah Shemirani
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA,
| | | | | | | | | | | | | | | |
Collapse
|
5
|
Ryan N, Ormond C, Chang YC, Contreras J, Raventos H, Gill M, Heron E, Mathews CA, Corvin A. Identity-by-descent analysis of a large Tourette's syndrome pedigree from Costa Rica implicates genes involved in neuronal development and signal transduction. Mol Psychiatry 2022; 27:5020-5027. [PMID: 36224258 PMCID: PMC9763103 DOI: 10.1038/s41380-022-01771-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 05/13/2022] [Accepted: 08/30/2022] [Indexed: 01/14/2023]
Abstract
Tourette Syndrome (TS) is a heritable, early-onset neuropsychiatric disorder that typically begins in early childhood. Identifying rare genetic variants that make a significant contribution to risk in affected families may provide important insights into the molecular aetiology of this complex and heterogeneous syndrome. Here we present a whole-genome sequencing (WGS) analysis from the 11-generation pedigree (>500 individuals) of a densely affected Costa Rican family which shares ancestry from six founder pairs. By conducting an identity-by-descent (IBD) analysis using WGS data from 19 individuals from the extended pedigree we have identified putative risk haplotypes that were not seen in controls, and can be linked with four of the six founder pairs. Rare coding and non-coding variants present on the haplotypes and only seen in haplotype carriers show an enrichment in pathways such as regulation of locomotion and signal transduction, suggesting common mechanisms by which the haplotype-specific variants may be contributing to TS-risk in this pedigree. In particular we have identified a rare deleterious missense variation in RAPGEF1 on a chromosome 9 haplotype and two ultra-rare deleterious intronic variants in ERBB4 and IKZF2 on the same chromosome 2 haplotype. All three genes play a role in neurodevelopment. This study, using WGS data in a pedigree-based approach, shows the importance of investigating both coding and non-coding variants to identify genes that may contribute to disease risk. Together, the genes and variants identified on the IBD haplotypes represent biologically relevant targets for investigation in other pedigree and population-based TS data.
Collapse
Affiliation(s)
- Niamh Ryan
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Cathal Ormond
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Yi-Chieh Chang
- Department of Psychiatry, Center for OCD, Anxiety, and Related Disorders, University of Florida, Gainesville, FL, USA
| | - Javier Contreras
- Centro de Investigación en Biología Celular y Molecular, Universidad de Costa Rica, San José, Costa Rica
| | - Henriette Raventos
- Centro de Investigación en Biología Celular y Molecular, Universidad de Costa Rica, San José, Costa Rica
- School of Biology, Universidad de Costa Rica, San José, Costa Rica
| | - Michael Gill
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Elizabeth Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Carol A Mathews
- Department of Psychiatry, Center for OCD, Anxiety, and Related Disorders, University of Florida, Gainesville, FL, USA.
- University of Florida Genetics Institute, University of Florida, Gainesville, FL, USA.
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry, Trinity College Dublin, Dublin, Ireland.
| |
Collapse
|
6
|
Yue W, Naseri A, Wang V, Shakya P, Zhang S, Zhi D. P-smoother: efficient PBWT smoothing of large haplotype panels. BIOINFORMATICS ADVANCES 2022; 2:vbac045. [PMID: 35785021 PMCID: PMC9245627 DOI: 10.1093/bioadv/vbac045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 05/03/2022] [Accepted: 06/15/2022] [Indexed: 01/27/2023]
Abstract
Motivation As large haplotype panels become increasingly available, efficient string matching algorithms such as positional Burrows-Wheeler transformation (PBWT) are promising for identifying shared haplotypes. However, recent mutations and genotyping errors create occasional mismatches, presenting challenges for exact haplotype matching. Previous solutions are based on probabilistic models or seed-and-extension algorithms that passively tolerate mismatches. Results Here, we propose a PBWT-based smoothing algorithm, P-smoother, to actively 'correct' these mismatches and thus 'smooth' the panel. P-smoother runs a bidirectional PBWT-based panel scanning that flips mismatching alleles based on the overall haplotype matching context, which we call the IBD (identical-by-descent) prior. In a simulated panel with 4000 haplotypes and a 0.2% error rate, we show it can reliably correct 85% of errors. As a result, PBWT algorithms running over the smoothed panel can identify more pairwise IBD segments than that over the unsmoothed panel. Most strikingly, a PBWT-cluster algorithm running over the smoothed panel, which we call PS-cluster, achieves state-of-the-art performance for identifying multiway IBD segments, a challenging problem in the computational community for years. We also showed that PS-cluster is adequately efficient for UK Biobank data. Therefore, P-smoother opens up new possibilities for efficient error-tolerating algorithms for biobank-scale haplotype panels. Availability and implementation Source code is available at github.com/ZhiGroup/P-smoother.
Collapse
Affiliation(s)
- William Yue
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Ardalan Naseri
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Victor Wang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Pramesh Shakya
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL 32816, USA
| | - Degui Zhi
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
7
|
Harold D, Connolly S, Riley BP, Kendler KS, McCarthy SE, McCombie WR, Richards A, Owen MJ, O'Donovan MC, Walters J, Donohoe G, Gill M, Corvin A, Morris DW. Population-based identity-by-descent mapping combined with exome sequencing to detect rare risk variants for schizophrenia. Am J Med Genet B Neuropsychiatr Genet 2019; 180:223-231. [PMID: 30801977 PMCID: PMC8863274 DOI: 10.1002/ajmg.b.32716] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/16/2018] [Revised: 10/22/2018] [Accepted: 12/03/2018] [Indexed: 12/30/2022]
Abstract
Genome-wide association studies (GWASs) are highly effective at identifying common risk variants for schizophrenia. Rare risk variants are also important contributors to schizophrenia etiology but, with the exception of large copy number variants, are difficult to detect with GWAS. Exome and genome sequencing, which have accelerated the study of rare variants, are expensive so alternative methods are needed to aid detection of rare variants. Here we re-analyze an Irish schizophrenia GWAS dataset (n = 3,473) by performing identity-by-descent (IBD) mapping followed by exome sequencing of individuals identified as sharing risk haplotypes to search for rare risk variants in coding regions. We identified 45 rare haplotypes (>1 cM) that were significantly more common in cases than controls. By exome sequencing 105 haplotype carriers, we investigated these haplotypes for functional coding variants that could be tested for association in independent GWAS samples. We identified one rare missense variant in PCNT but did not find statistical support for an association with schizophrenia in a replication analysis. However, IBD mapping can prioritize both individual samples and genomic regions for follow-up analysis but genome rather than exome sequencing may be more effective at detecting risk variants on rare haplotypes.
Collapse
Affiliation(s)
- Denise Harold
- Neuropsychiatric Genetics Research Group, Institute of Molecular Medicine and Discipline of Psychiatry, Trinity College Dublin, Dublin, Ireland
- School of Biotechnology, Dublin City University, Dublin, Ireland
| | - Siobhan Connolly
- Neuropsychiatric Genetics Research Group, Institute of Molecular Medicine and Discipline of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Brien P Riley
- Departments of Psychiatry and Human Genetics, Virginia Institute of Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia
| | - Kenneth S Kendler
- Departments of Psychiatry and Human Genetics, Virginia Institute of Psychiatric and Behavioral Genetics, Virginia Commonwealth University, Richmond, Virginia
| | - Shane E McCarthy
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York
| | - William R McCombie
- The Stanley Institute for Cognitive Genomics, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York
| | - Alex Richards
- MRC Centre for Neuropsychiatric Genetics and Genomics, Cardiff University School of Medicine, Cardiff, United Kingdom
| | - Michael J Owen
- MRC Centre for Neuropsychiatric Genetics and Genomics, Cardiff University School of Medicine, Cardiff, United Kingdom
| | - Michael C O'Donovan
- MRC Centre for Neuropsychiatric Genetics and Genomics, Cardiff University School of Medicine, Cardiff, United Kingdom
| | - James Walters
- MRC Centre for Neuropsychiatric Genetics and Genomics, Cardiff University School of Medicine, Cardiff, United Kingdom
| | - Gary Donohoe
- Cognitive Genetics and Cognitive Therapy Group, Neuroimaging, Cognition & Genomics (NICOG) Centre & NCBES Galway Neuroscience Centre, School of Psychology and Discipline of Biochemistry, National University of Ireland Galway, Galway, Ireland
| | - Michael Gill
- Neuropsychiatric Genetics Research Group, Institute of Molecular Medicine and Discipline of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Institute of Molecular Medicine and Discipline of Psychiatry, Trinity College Dublin, Dublin, Ireland
| | - Derek W Morris
- Cognitive Genetics and Cognitive Therapy Group, Neuroimaging, Cognition & Genomics (NICOG) Centre & NCBES Galway Neuroscience Centre, School of Psychology and Discipline of Biochemistry, National University of Ireland Galway, Galway, Ireland
| |
Collapse
|
8
|
Park DS, Baran Y, Hormozdiari F, Eng C, Torgerson DG, Burchard EG, Zaitlen N. PIGS: improved estimates of identity-by-descent probabilities by probabilistic IBD graph sampling. BMC Bioinformatics 2015; 16 Suppl 5:S9. [PMID: 25860540 PMCID: PMC4402697 DOI: 10.1186/1471-2105-16-s5-s9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Identifying segments in the genome of different individuals that are identical-by-descent (IBD) is a fundamental element of genetics. IBD data is used for numerous applications including demographic inference, heritability estimation, and mapping disease loci. Simultaneous detection of IBD over multiple haplotypes has proven to be computationally difficult. To overcome this, many state of the art methods estimate the probability of IBD between each pair of haplotypes separately. While computationally efficient, these methods fail to leverage the clique structure of IBD resulting in less powerful IBD identification, especially for small IBD segments. We develop a hybrid approach (PIGS), which combines the computational efficiency of pairwise methods with the power of multiway methods. It leverages the IBD graph structure to compute the probability of IBD conditional on all pairwise estimates simultaneously. We show via extensive simulations and analysis of real data that our method produces a substantial increase in the number of identified small IBD segments.
Collapse
|