1
|
Naseri A, Zhi D, Zhang S. Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank. eLife 2024; 13:e81698. [PMID: 38905121 PMCID: PMC11249732 DOI: 10.7554/elife.81698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 06/20/2024] [Indexed: 06/23/2024] Open
Abstract
Runs-of-homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for the efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE (runs-of-homozygous diplotype cluster enumerator), to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 single nucleotide polymorphisms (SNPs) and are shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended human leukocyte antigen (HLA) region and autoimmune disorders. We found an association between a diplotype covering the homeostatic iron regulator (HFE) gene and hemochromatosis, even though the well-known causal SNP was not directly genotyped or imputed. Using a genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase in mortality among COVID-19 patients (p-value = 1.82 × 10-11). In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at a population scale.
Collapse
Affiliation(s)
- Ardalan Naseri
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at HoustonHoustonUnited States
| | - Shaojie Zhang
- Department of Computer Science, University of Central FloridaOrlandoUnited States
| |
Collapse
|
2
|
Cozzi D, Rossi M, Rubinacci S, Gagie T, Köppl D, Boucher C, Bonizzoni P. μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics 2023; 39:btad552. [PMID: 37688560 PMCID: PMC10502237 DOI: 10.1093/bioinformatics/btad552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/07/2023] [Accepted: 09/07/2023] [Indexed: 09/11/2023] Open
Abstract
MOTIVATION The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
Collapse
Affiliation(s)
- Davide Cozzi
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan 20126, Italy
| | - Massimiliano Rossi
- Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, United States
| | - Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
| | - Dominik Köppl
- M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 113-8510, Japan
- Department of Computer Science, University of Muenster, Muenster 48149, Germany
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, United States
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan 20126, Italy
| |
Collapse
|
3
|
Naseri A, Yue W, Zhang S, Zhi D. Fast inference of genetic recombination rates in biobank scale data. Genome Res 2023; 33:1015-1022. [PMID: 37349109 PMCID: PMC10538484 DOI: 10.1101/gr.277676.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 06/09/2023] [Indexed: 06/24/2023]
Abstract
Although rates of recombination events across the genome (genetic maps) are fundamental to genetic research, the majority of current studies only use one standard map. There is evidence suggesting population differences in genetic maps, and thus estimating population-specific maps, are of interest. Although the recent availability of biobank-scale data offers such opportunities, current methods are not efficient at leveraging very large sample sizes. The most accurate methods are still linkage disequilibrium (LD)-based methods that are only tractable for a few hundred samples. In this work, we propose a fast and memory-efficient method for estimating genetic maps from population genotyping data. Our method, FastRecomb, leverages the efficient positional Burrows-Wheeler transform (PBWT) data structure for counting IBD segment boundaries as potential recombination events. We used PBWT blocks to avoid redundant counting of pairwise matches. Moreover, we used a panel-smoothing technique to reduce the noise from errors and recent mutations. Using simulation, we found that FastRecomb achieves state-of-the-art performance at 10-kb resolution, in terms of correlation coefficients between the estimated map and the ground truth. This is mainly because FastRecomb can effectively take advantage of large panels comprising more than hundreds of thousands of haplotypes. At the same time, other methods lack the efficiency to handle such data. We believe further refinement of FastRecomb would deliver more accurate genetic maps for the genetics community.
Collapse
Affiliation(s)
- Ardalan Naseri
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - William Yue
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, Florida 32816, USA
| | - Degui Zhi
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA;
| |
Collapse
|
4
|
Sanaullah A, Zhi D, Zhang S. d-PBWT: dynamic positional Burrows-Wheeler transform. Bioinformatics 2021; 37:2390-2397. [PMID: 33624749 DOI: 10.1093/bioinformatics/btab117] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Revised: 01/31/2021] [Accepted: 02/23/2021] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Durbin's positional Burrows-Wheeler transform (PBWT) is a scalable data structure for haplotype matching. It has been successfully applied to identical by descent (IBD) segment identification and genotype imputation. Once the PBWT of a haplotype panel is constructed, it supports efficient retrieval of all shared long segments among all individuals (long matches) and efficient query between an external haplotype and the panel. However, the standard PBWT is an array-based static data structure and does not support dynamic updates of the panel. RESULTS Here, we generalize the static PBWT to a dynamic data structure, d-PBWT, where the reverse prefix sorting at each position is stored with linked lists.We also developed efficient algorithms for insertion and deletion of individual haplotypes. In addition, we verified that d-PBWT can support all algorithms of PBWT. In doing so, we systematically investigated variations of set maximal match and long match query algorithms: while they all have average case time complexity independent of database size, they have different worst case complexities and dependencies on additional data structures. AVAILABILITY The benchmarking code is available at genome.ucf.edu/d-PBWT. SUPPLEMENTARY INFORMATION Supplementary Materials are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ahsan Sanaullah
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, FL, USA
| |
Collapse
|
5
|
Naseri A, Zhi D, Zhang S. Discovery of runs-of-homozygosity diplotype clusters and their associations with diseases in UK Biobank. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2020:2020.10.26.20220004. [PMID: 33140058 PMCID: PMC7605569 DOI: 10.1101/2020.10.26.20220004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
Runs of homozygosity (ROH) segments, contiguous homozygous regions in a genome were traditionally linked to families and inbred populations. However, a growing literature suggests that ROHs are ubiquitous in outbred populations. Still, most existing genetic studies of ROH in populations are limited to aggregated ROH content across the genome, which does not offer the resolution for mapping causal loci. This limitation is mainly due to a lack of methods for efficient identification of shared ROH diplotypes. Here, we present a new method, ROH-DICE, to find large ROH diplotype clusters, sufficiently long ROHs shared by a sufficient number of individuals, in large cohorts. ROH-DICE identified over 1 million ROH diplotypes that span over 100 SNPs and shared by more than 100 UK Biobank participants. Moreover, we found significant associations of clustered ROH diplotypes across the genome with various self-reported diseases, with the strongest associations found between the extended HLA region and autoimmune disorders. We found an association between a diplotype covering the HFE gene and haemochromatosis, even though the well-known causal SNP was not directly genotyped nor imputed. Using genome-wide scan, we identified a putative association between carriers of an ROH diplotype in chromosome 4 and an increase of mortality among COVID-19 patients. In summary, our ROH-DICE method, by calling out large ROH diplotypes in a large outbred population, enables further population genetics into the demographic history of large populations. More importantly, our method enables a new genome-wide mapping approach for finding disease-causing loci with multi-marker recessive effects at population scale.
Collapse
Affiliation(s)
- Ardalan Naseri
- Department of Computer Science, University of Central Florida, Orlando, Florida 32816, USA
| | - Degui Zhi
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Shaojie Zhang
- Department of Computer Science, University of Central Florida, Orlando, Florida 32816, USA
| |
Collapse
|
6
|
Williams L, Mumey B. Maximal Perfect Haplotype Blocks with Wildcards. iScience 2020; 23:101149. [PMID: 32446220 PMCID: PMC7243190 DOI: 10.1016/j.isci.2020.101149] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 04/27/2020] [Accepted: 05/05/2020] [Indexed: 11/30/2022] Open
Abstract
Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding maximal perfect haplotype blocks from a set of genomic samples for which SNPs (single-nucleotide polymorphisms) have been called. Often, owing to low read coverage and imperfect assemblies, some of the SNP calls can be missing from some of the samples. In this work, we consider the problem of finding maximal perfect haplotype blocks where some missing values may be present. Missing values are treated as wildcards, and the definition of maximal perfect haplotype blocks is extended in a natural way. We provide an output-linear time algorithm to identify all such blocks and demonstrate the algorithm on a large population SNP dataset. Our software is publicly available. Defined haplotype blocks to study SNP population data with missing values Developed a fast software tool to find these blocks Tested on a human chromosome 22 dataset of 5,008 samples and over one million SNPs
Collapse
Affiliation(s)
- Lucia Williams
- Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA.
| | - Brendan Mumey
- Gianforte School of Computing, Montana State University, Bozeman, MT 59717, USA
| |
Collapse
|
7
|
Williams L, Mumey B. Extending Maximal Perfect Haplotype Blocks to the Realm of Pangenomics. ALGORITHMS FOR COMPUTATIONAL BIOLOGY 2020. [PMCID: PMC7197059 DOI: 10.1007/978-3-030-42266-0_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/04/2022]
Abstract
Recent work provides the first method to measure the relative fitness of genomic variants within a population that scales to large numbers of genomes. A key component of the computation involves finding conserved haplotype blocks, which can be done in linear time. Here, we extend the notion of conserved haplotype blocks to pangenomes, which can store more complex variation than a single reference genome. We define a maximal perfect pangenome haplotype block and give a linear-time, suffix tree based approach to find all such blocks from a set of pangenome haplotypes. We demonstrate the method by applying it to a pangenome built from yeast strains.
Collapse
|